Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-18 Thread Peng Cheng
Really? I guess PageRank in mahout was removed due to inherited network 
bottleneck of mapreduce. But I didn't know MLlib has the implementation. 
Is mllib implementation based on Lanczos or SSVD? Just curious...


On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote:

I bet page rank in mllib in spark finds stationary distribution much faster.
On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:


Agreed, and this is the case where Lanczos algorithm is obsolete.
My point is: if SSVD is unable to find the eigenvector of asymmetric
matrix (this is a common formulation of PageRank, and some random walks,
and many other things), then we still have to rely on large-scale Lanczos
algorithm.

On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:


For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

  If SSVD is not designed for such eigenvector problem. Then I would vote

for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate
than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

  In PageRank I'm afraid I have no other option than eigenvector

\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

  SSVD is very probably better than Lanczos for any large decomposition.

That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

   Just asking for possible replacement of our Lanczos-based PageRank


implementation. - Peng







Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-18 Thread Sebastian Schelter
You can also use giraph for a superfast PageRank implementation. Giraph
even runs on standard hadoop clusters.

Pagerank is usually computed by power iteration, which is much simpler than
lanczos or ssvd and only gives the eigenvector associated with the largest
eigenvalue.
Am 18.02.2014 09:33 schrieb Peng Cheng pc...@uowmail.edu.au:

 Really? I guess PageRank in mahout was removed due to inherited network
 bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is
 mllib implementation based on Lanczos or SSVD? Just curious...

 On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote:

 I bet page rank in mllib in spark finds stationary distribution much
 faster.
 On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:

  Agreed, and this is the case where Lanczos algorithm is obsolete.
 My point is: if SSVD is unable to find the eigenvector of asymmetric
 matrix (this is a common formulation of PageRank, and some random walks,
 and many other things), then we still have to rely on large-scale Lanczos
 algorithm.

 On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

  For the symmetric case, SVD is eigen decomposition.




 On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

   If SSVD is not designed for such eigenvector problem. Then I would
 vote

 for retaining the Lanczos algorithm.
 However, I would like to see the opposite case, I have tested both
 algorithms on symmetric case and SSVD is much faster and more accurate
 than
 its competitor.

 Yours Peng

 On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

   In PageRank I'm afraid I have no other option than eigenvector

 \lambda, but not singular vector u  v:) The PageRank in Mahout was
 removed with other graph-based algorithm.

 On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

   SSVD is very probably better than Lanczos for any large
 decomposition.

 That said, it does SVD, not eigen decomposition which means that
 the
 question of symmetrical matrices or positive definiteness doesn't
 much
 matter.

 Do you really need eigen-decomposition?



 On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

Just asking for possible replacement of our Lanczos-based PageRank

  implementation. - Peng







[jira] [Created] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-18 Thread Isabel Drost-Fromm (JIRA)
Isabel Drost-Fromm created MAHOUT-1418:
--

 Summary: Removal of write access to anything but CMS for username 
isabel
 Key: MAHOUT-1418
 URL: https://issues.apache.org/jira/browse/MAHOUT-1418
 Project: Mahout
  Issue Type: Task
Reporter: Isabel Drost-Fromm
Assignee: Isabel Drost-Fromm
Priority: Trivial


Hi,

Please remove write access to user name isabel - effective mid March. For 
background check the Mahout board report of October last year*.

Don't worry - I'm not planning to go completely silent and offline by then. 
However I know from several years of doing Berlin Buzzwords that being 
completely sleep deprived is not a good state to commit to subversion - except 
that when sleep deprived I usually don't remember this insight. So this is my 
security net forcing me to go through the regular submit patch in jira, get it 
reviewed and committed cycle (except for documentation changes).

* 
http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-18 Thread Isabel Drost-Fromm (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903884#comment-13903884
 ] 

Isabel Drost-Fromm commented on MAHOUT-1418:


Created INFRA issue due mid March to execute the permission change.

 Removal of write access to anything but CMS for username isabel
 ---

 Key: MAHOUT-1418
 URL: https://issues.apache.org/jira/browse/MAHOUT-1418
 Project: Mahout
  Issue Type: Task
Reporter: Isabel Drost-Fromm
Assignee: Isabel Drost-Fromm
Priority: Trivial

 Hi,
 Please remove write access to user name isabel - effective mid March. For 
 background check the Mahout board report of October last year*.
 Don't worry - I'm not planning to go completely silent and offline by then. 
 However I know from several years of doing Berlin Buzzwords that being 
 completely sleep deprived is not a good state to commit to subversion - 
 except that when sleep deprived I usually don't remember this insight. So 
 this is my security net forcing me to go through the regular submit patch in 
 jira, get it reviewed and committed cycle (except for documentation changes).
 * 
 http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-18 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904200#comment-13904200
 ] 

Suneel Marthi commented on MAHOUT-1418:
---

Are u serious u really need to do this ?

 Removal of write access to anything but CMS for username isabel
 ---

 Key: MAHOUT-1418
 URL: https://issues.apache.org/jira/browse/MAHOUT-1418
 Project: Mahout
  Issue Type: Task
Reporter: Isabel Drost-Fromm
Assignee: Isabel Drost-Fromm
Priority: Trivial

 Hi,
 Please remove write access to user name isabel - effective mid March. For 
 background check the Mahout board report of October last year*.
 Don't worry - I'm not planning to go completely silent and offline by then. 
 However I know from several years of doing Berlin Buzzwords that being 
 completely sleep deprived is not a good state to commit to subversion - 
 except that when sleep deprived I usually don't remember this insight. So 
 this is my security net forcing me to go through the regular submit patch in 
 jira, get it reviewed and committed cycle (except for documentation changes).
 * 
 http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Mahout 0.9 Release Notes - First Draft

2014-02-18 Thread Suneel Marthi
Could someone please point me to the URL for adding Mahout release notes?  




On Monday, February 17, 2014 3:27 PM, Ellen Friedman 
b.ellen.fried...@gmail.com wrote:
 

Hi Suneel,

Thanks for notes. I'm inquiring about status of the notes and update to the 
website to announce 0.9: Ted has reviewed the release notes - were you waiting 
for additional input or are they ready to go on the website? Are you the one 
who updates the site?

I've been asked to write a short blog on the release but wanted to wait until 
the site is updated.

Thanks much
Ellen





On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Here's a draft of the Release Notes for Mahout 0.9, Please review the same.

--



The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
Mahout's goal is to build scalable machine learning libraries focused
primarily in the areas of collaborative filtering (recommenders),
clustering and classification (known collectively as the 3Cs), as well as the
necessary infrastructure to support those implementations including, but
not limited to, math packages for statistics, linear algebra and others
as well as Java primitive collections, local and distributed vector and
matrix classes and a variety of integrative code to work with popular
packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
Cassandra and much more. The 0.9 release is mainly a clean up release in
preparation for an upcoming 1.0 release targeted for first half of 2014, but 
there are a few
significant new features, which are highlighted below.

To get started with Apache Mahout 0.9, download the release artifacts and 
signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central 
Maven repository.


As with any release, we wish to thank all of the users and contributors
to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
individual credits, as there are too many to list here.

GETTING STARTED

In the release package, the examples directory contains several working 
examples of the core
functionality available in Mahout. These can be run via scripts in the 
examples/bin
directory and will prompt you for more information to help you try things out.
Most examples do not need a Hadoop cluster in order to run.

RELEASE HIGHLIGHTS

The highlights of the Apache Mahout 0.9 release include, but are not
limited to the list below. For further information, see the included 
CHANGELOG[1] file.

-  MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra.
   See 
http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
-  MAHOUT-1288: Recommenders as a Search.  See 
https://github.com/pferrel/solr-recommender
-  MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1

-  MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 
1-dimensional Clustering
  See 
https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
 for the details.
-  MAHOUT-1265: MultiLayer Perceptron (MLP) classifier
   This is an early implementation of MLP to solicit user feedback, needs to 
be integrated into Mahout’s processing pipeline to work with Mahout’s vectors.

- Removed Deprecated algorithms as they have been either replaced by better 
performing algorithms or lacked user support and maintenance.

- the usual bug fixes. See [2] for more information on the 0.9 release.

A total of 113 separate JIRA issues were addressed in this release.


The following algorithms that were marked deprecated in 0.8 have been removed 
in 0.9:

- From Clustering:
   Switched LDA implementation from using Dirtichlet to Collapsed Variational 
Bayes (CVB)

  Meanshift

  MinHash - removed due to poor performance,  lack of support and lack of usage


- From Classification (both are sequential implementations)

  Winnow - lack of actual usage and support

  Perceptron - lack of actual usage and support

- Collaborative Filtering

SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and 
org.apache.mahout.cf.taste.impl.recommender.slopeone
    Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
    TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

- Mahout Math

    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy


CONTRIBUTING

Mahout is always looking for contributions focused on the 3Cs. If you are
interested in contributing, please see our contribution page 
http://mahout.apache.org/developers/how-to-contribute.html or contact us via 
email at dev@mahout.apache.org.


As the project moves towards a 1.0 release, the community will be focused on 
key algorithms that are proven to scale in production and have seen 
wide-spread adoption.

[1] 
http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661
[2] 

Apache Mahout 0.9 released

2014-02-18 Thread Suneel Marthi
The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
Mahout's goal is to build scalable machine learning libraries focused
primarily in the areas of collaborative filtering (recommenders),
clustering and classification (known collectively as the 3Cs), as well as the
necessary infrastructure to support those implementations including, but
not limited to, math packages for statistics, linear algebra and others
as well as Java primitive collections, local and distributed vector and
matrix classes and a variety of integrative code to work with popular
packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
Cassandra and much more. The 0.9 release is mainly a clean up release in
preparation for an upcoming 1.0 release targeted for first half of 2014, but 
there are a few
significant new features, which are highlighted below.

To get started with Apache Mahout 0.9, download the release artifacts and 
signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central 
Maven repository.

As with any release, we wish to thank all of the users and contributors
to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
individual credits, as there are too many to list here.

GETTING STARTED

In the release package, the examples directory contains several working 
examples of the core
functionality available in Mahout. These can be run via scripts in the 
examples/bin
directory and will prompt you for more information to help you try things out. 
Most examples do not need a Hadoop cluster in order to run.

RELEASE HIGHLIGHTS

The highlights of the Apache Mahout 0.9 release include, but are not
limited to the list below. For further information, see the included 
CHANGELOG[1] file.

-  MAHOUT-1245: A new and improved Mahout website based on Apache CMS
-  MAHOUT-1265: MultiLayer Perceptron (MLP) classifier 
   This is an early implementation of MLP to solicit user feedback, needs to be 
integrated into Mahout’s processing pipeline to work with Mahout’s vectors.
-  MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra.  See 
http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
-  MAHOUT-1288: Recommenders as a Search.  See 
https://github.com/pferrel/solr-recommender
-  MAHOUT-1300: Suport for easy functional Matrix views and derivatives
-  MAHOUT-1343: JSON output format for ClusterDumper
-  MAHOUT-1345: Enable randomised testing for all Mahout modules using Carrot 
RandomizedRunner. 
-  MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 
1-dimensional Clustering.  See 
https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
 for the details.
-  MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1


- Removed Deprecated algorithms as they have been either replaced by better 
performing algorithms or lacked user support and maintenance.

- the usual bug fixes. See [2] for more information on the 0.9 release.

A total of 113 separate JIRA issues were addressed in this release.

The following algorithms that were marked deprecated in 0.8 have been removed 
in 0.9:

- From Clustering:
   Switched LDA implementation from using Gibbs Sampling to Collapsed 
Variational Bayes (CVB)

  Meanshift

  MinHash - removed due to poor performance,  lack of support and lack of usage

- From Classification (both are sequential implementations)

  Winnow - lack of actual usage and support

  Perceptron - lack of actual usage and support

- Collaborative Filtering
SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and 
org.apache.mahout.cf.taste.impl.recommender.slopeone
    Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
    TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

- Mahout Math
    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy


CONTRIBUTING

Mahout is always looking for contributions focused on the 3Cs. If you are
interested in contributing, please see our contribution page 
http://mahout.apache.org/developers/how-to-contribute.html or contact us via 
email at dev@mahout.apache.org.


As the project moves towards a 1.0 release, the community will be focused on 
key algorithms that are proven to scale in production and have seen wide-spread 
adoption. 

[1] 
http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661
[2] 
https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22

Re: Mahout 0.9 Release Notes - First Draft

2014-02-18 Thread Suneel Marthi
Below r the release notes, not sure where they should be going on the website. 
If someone could point me to a location I will go ahead and update the same.

=

The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
Mahout's goal is to build scalable machine learning libraries focused
primarily in the areas of
 collaborative filtering (recommenders),
clustering and classification (known collectively as the 3Cs), as well as the
necessary infrastructure to support those implementations including, but
not limited to, math packages for statistics, linear algebra and others
as well as Java primitive collections, local and distributed vector and
matrix classes and a variety of integrative code to work with popular
packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
Cassandra and much more. The 0.9 release is mainly a clean up release in
preparation for an upcoming 1.0 release targeted for first half of 2014, but 
there are a few
significant new features, which are highlighted below.

To get started with Apache Mahout 0.9, download the release artifacts and 
signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the central 
Maven repository.

As with any release, we wish to thank all of the users and
 contributors
to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
individual credits, as there are too many to list here.

GETTING STARTED

In the release package, the examples directory contains several working 
examples of the core
functionality available in Mahout. These can be run via scripts in the 
examples/bin
directory and will prompt you for more information to help you try things out. 
Most examples do not need a Hadoop cluster in order to run.

RELEASE HIGHLIGHTS

The highlights of the Apache Mahout 0.9 release include, but are not
limited to the list below. For further information, see the included 
CHANGELOG[1] file.

-  MAHOUT-1245: A new and improved Mahout website based on Apache CMS
-  MAHOUT-1265: MultiLayer Perceptron (MLP) classifier 
   This is an early implementation of MLP to solicit user feedback, needs to be 
integrated into Mahout’s
 processing pipeline to work with Mahout’s vectors.
-  MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra.  See 
http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
-  MAHOUT-1288: Recommenders as a Search.  See 
https://github.com/pferrel/solr-recommender
-  MAHOUT-1300: Suport for easy functional Matrix views and derivatives
-  MAHOUT-1343: JSON output format for ClusterDumper
-  MAHOUT-1345: Enable randomised testing for all Mahout modules using Carrot 
RandomizedRunner. 
-  MAHOUT-1361: Online Algorithm for computing accurate Quantiles using 
1-dimensional Clustering.  See 
https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
 for the details.
-  MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1


- Removed Deprecated algorithms as they have been either replaced by better 
performing algorithms or
 lacked user support and maintenance.

- the usual bug fixes. See [2] for more information on the 0.9 release.

A total of 113 separate JIRA issues were addressed in this release.

The following algorithms that were marked deprecated in 0.8 have been removed 
in 0.9:

- From Clustering:
   Switched LDA implementation from using Gibbs Sampling to Collapsed 
Variational Bayes (CVB)

  Meanshift

  MinHash - removed due to poor performance,  lack of support and lack of usage

- From Classification (both are sequential implementations)

  Winnow - lack of actual usage and support

  Perceptron - lack of actual usage and support

- Collaborative Filtering
SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and 
org.apache.mahout.cf.taste.impl.recommender.slopeone
    Distributed pseudo recommender in
 org.apache.mahout.cf.taste.hadoop.pseudo
    TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

- Mahout Math
    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy


CONTRIBUTING

Mahout is always looking for contributions focused on the 3Cs. If you are
interested in contributing, please see our contribution page 
http://mahout.apache.org/developers/how-to-contribute.html or contact us via 
email at dev@mahout.apache.org.


As the project moves towards a 1.0 release, the community will be focused on 
key algorithms that are proven to scale in production and have seen wide-spread 
adoption. 

[1] 
http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661
[2] 
https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22





On Monday, February 17, 2014 3:27 PM, Ellen Friedman 
b.ellen.fried...@gmail.com wrote:
 

Hi Suneel,

Thanks for notes. I'm inquiring about status of the notes and update to the 
website to announce 0.9: Ted has reviewed the release notes - were you waiting 

[jira] [Issue Comment Deleted] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-18 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1418:
--

Comment: was deleted

(was: Created INFRA issue due mid March to execute the permission change.)

 Removal of write access to anything but CMS for username isabel
 ---

 Key: MAHOUT-1418
 URL: https://issues.apache.org/jira/browse/MAHOUT-1418
 Project: Mahout
  Issue Type: Task
Reporter: Isabel Drost-Fromm
Assignee: Isabel Drost-Fromm
Priority: Trivial

 Hi,
 Please remove write access to user name isabel - effective mid March. For 
 background check the Mahout board report of October last year*.
 Don't worry - I'm not planning to go completely silent and offline by then. 
 However I know from several years of doing Berlin Buzzwords that being 
 completely sleep deprived is not a good state to commit to subversion - 
 except that when sleep deprived I usually don't remember this insight. So 
 this is my security net forcing me to go through the regular submit patch in 
 jira, get it reviewed and committed cycle (except for documentation changes).
 * 
 http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Issue Comment Deleted] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-18 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1418:
--

Comment: was deleted

(was: Are u serious u really need to do this ?)

 Removal of write access to anything but CMS for username isabel
 ---

 Key: MAHOUT-1418
 URL: https://issues.apache.org/jira/browse/MAHOUT-1418
 Project: Mahout
  Issue Type: Task
Reporter: Isabel Drost-Fromm
Assignee: Isabel Drost-Fromm
Priority: Trivial

 Hi,
 Please remove write access to user name isabel - effective mid March. For 
 background check the Mahout board report of October last year*.
 Don't worry - I'm not planning to go completely silent and offline by then. 
 However I know from several years of doing Berlin Buzzwords that being 
 completely sleep deprived is not a good state to commit to subversion - 
 except that when sleep deprived I usually don't remember this insight. So 
 this is my security net forcing me to go through the regular submit patch in 
 jira, get it reviewed and committed cycle (except for documentation changes).
 * 
 http://www.apache.org/foundation/records/minutes/2013/board_minutes_2013_10_16.txt



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-18 Thread Sean Owen (JIRA)
Sean Owen created MAHOUT-1419:
-

 Summary: Random decision forest is excessively slow on numeric 
features
 Key: MAHOUT-1419
 URL: https://issues.apache.org/jira/browse/MAHOUT-1419
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.9, 0.8, 0.7
Reporter: Sean Owen


Follow-up to MAHOUT-1417. There's a customer running this and observing it take 
an unreasonably long time on about 2GB of data -- like, 24 hours when other 
RDF M/R implementations take 9 minutes. The difference is big enough to 
probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. I 
am trying to further improve it.

One key issue seems to be how splits are evaluated over numeric features. A 
split is tried for every distinct numeric value of the feature in the whole 
data set. Since these are floating point values, they could (and in the 
customer's case are) all distinct. 200K rows means 200K splits to evaluate 
every time a node is built on the feature.

A better approach is to sample percentiles out of the feature and evaluate only 
those as splits. Really doing that efficiently would require a lot of rewrite. 
However, there are some modest changes possible which get some of the benefit, 
and appear to make it run about 3x faster. That is --on a data set that 
exhibits this problem -- meaning one using numeric features which are generally 
distinct. Which is not exotic.

There are comparable but different problems with handling of categorical 
features, but that's for a different patch.

I have a patch, but it changes behavior to some extent since it is evaluating 
only a sample of splits instead of every single possible one. In particular it 
makes the output of OptIgSplit no longer match the DefaultIgSplit. Although 
I think the point is that optimized may mean giving different choices of 
split here, which could yield differing trees. So that test probably has to go.

(Along the way I found a number of micro-optimizations in this part of the code 
that added up to maybe a 3% speedup. And fixed an NPE too.)

I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-18 Thread peng
Thanks a lot Sebastian, Ted and Dmitriy, I'll try Giraph for a 
performance benchmark.
You are right, power iteration is just the most simple form of Lanczos, 
it shouldn't be in the scope.


On Tue 18 Feb 2014 03:59:57 AM EST, Sebastian Schelter wrote:

You can also use giraph for a superfast PageRank implementation. Giraph
even runs on standard hadoop clusters.

Pagerank is usually computed by power iteration, which is much simpler than
lanczos or ssvd and only gives the eigenvector associated with the largest
eigenvalue.
Am 18.02.2014 09:33 schrieb Peng Cheng pc...@uowmail.edu.au:


Really? I guess PageRank in mahout was removed due to inherited network
bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is
mllib implementation based on Lanczos or SSVD? Just curious...

On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote:


I bet page rank in mllib in spark finds stationary distribution much
faster.
On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:

  Agreed, and this is the case where Lanczos algorithm is obsolete.

My point is: if SSVD is unable to find the eigenvector of asymmetric
matrix (this is a common formulation of PageRank, and some random walks,
and many other things), then we still have to rely on large-scale Lanczos
algorithm.

On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

  For the symmetric case, SVD is eigen decomposition.





On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

   If SSVD is not designed for such eigenvector problem. Then I would
vote


for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate
than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

   In PageRank I'm afraid I have no other option than eigenvector


\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

   SSVD is very probably better than Lanczos for any large
decomposition.


 That said, it does SVD, not eigen decomposition which means that
the
question of symmetrical matrices or positive definiteness doesn't
much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

Just asking for possible replacement of our Lanczos-based PageRank

  implementation. - Peng











Build failed in Jenkins: mahout-nightly » Mahout Release Package #1501

2014-02-18 Thread Apache Jenkins Server
See 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/1501/

--
[INFO] 
[INFO] 
[INFO] Building Mahout Release Package 1.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-distribution 
---
[INFO] Deleting 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/ws/target
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-distribution ---
[INFO] 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-distribution
 removed.
[INFO] 
[INFO] --- maven-assembly-plugin:2.4:single (bin-assembly) @ 
mahout-distribution ---
[INFO] Reading assembly descriptor: src/main/assembly/bin.xml


Build failed in Jenkins: mahout-nightly #1501

2014-02-18 Thread Apache Jenkins Server
See https://builds.apache.org/job/mahout-nightly/1501/

--
[...truncated 1978 lines...]
[INFO] Copying jackson-mapper-asl-1.9.12.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-mapper-asl-1.9.12.jar
[INFO] Copying lucene-benchmark-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-benchmark-4.6.1.jar
[INFO] Copying junit-4.11.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/junit-4.11.jar
[INFO] Copying lucene-sandbox-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-sandbox-4.6.1.jar
[INFO] Copying slf4j-api-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-api-1.7.5.jar
[INFO] Copying randomizedtesting-runner-2.0.15.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/randomizedtesting-runner-2.0.15.jar
[INFO] Copying jcl-over-slf4j-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jcl-over-slf4j-1.7.5.jar
[INFO] Copying mahout-math-1.0-SNAPSHOT-tests.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-math-1.0-SNAPSHOT-tests.jar
[INFO] Copying jaxb-api-2.2.2.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-api-2.2.2.jar
[INFO] Copying jettison-1.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jettison-1.1.jar
[INFO] Copying lucene-spatial-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-spatial-4.6.1.jar
[INFO] Copying commons-httpclient-3.0.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-httpclient-3.0.1.jar
[INFO] Copying commons-compress-1.4.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-compress-1.4.1.jar
[INFO] Copying hamcrest-core-1.3.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/hamcrest-core-1.3.jar
[INFO] Copying commons-lang-2.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-lang-2.4.jar
[INFO] Copying lucene-queries-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-queries-4.6.1.jar
[INFO] Copying commons-codec-1.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-codec-1.4.jar
[INFO] Copying solr-commons-csv-3.5.0.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/solr-commons-csv-3.5.0.jar
[INFO] Copying commons-collections-3.2.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-collections-3.2.1.jar
[INFO] Copying xercesImpl-2.9.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xercesImpl-2.9.1.jar
[INFO] Copying mahout-core-1.0-SNAPSHOT.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-core-1.0-SNAPSHOT.jar
[INFO] Copying xstream-1.4.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xstream-1.4.4.jar
[INFO] Copying mahout-integration-1.0-SNAPSHOT.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-integration-1.0-SNAPSHOT.jar
[INFO] Copying guava-16.0.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/guava-16.0.jar
[INFO] Copying slf4j-log4j12-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-log4j12-1.7.5.jar
[INFO] Copying cglib-nodep-2.2.2.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/cglib-nodep-2.2.2.jar
[INFO] Copying commons-configuration-1.6.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-configuration-1.6.jar
[INFO] Copying commons-io-2.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-io-2.4.jar
[INFO] Copying jackson-xc-1.7.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-xc-1.7.1.jar
[INFO] Copying jersey-json-1.8.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jersey-json-1.8.jar
[INFO] Copying commons-math-2.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-math-2.1.jar
[INFO] Copying jaxb-impl-2.2.3-1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-impl-2.2.3-1.jar
[INFO] Copying jersey-core-1.8.jar to 

[jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-18 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905056#comment-13905056
 ] 

Ted Dunning commented on MAHOUT-1419:
-

With t-digest in the OnlineSummarizer now, it is quite possible to have very 
accurate and quick quantile estimates.  That should allow very quick picking of 
splits as well.

The basic idea would be to keep an OnlineSummarizer for each numerical 
variable.  When a split is needed, pick a random number in [0,1) and then the 
split point is that quantile.  If you want 10 random splits, do this 10 times.  
If you want structured points like every percentile from 20 to 80 %, that is 
just as simple.

It seems like this would be as simple as changing a loop from looping over 
distinct values to looping over quantile values.

 Random decision forest is excessively slow on numeric features
 --

 Key: MAHOUT-1419
 URL: https://issues.apache.org/jira/browse/MAHOUT-1419
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7, 0.8, 0.9
Reporter: Sean Owen

 Follow-up to MAHOUT-1417. There's a customer running this and observing it 
 take an unreasonably long time on about 2GB of data -- like, 24 hours when 
 other RDF M/R implementations take 9 minutes. The difference is big enough to 
 probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
 I am trying to further improve it.
 One key issue seems to be how splits are evaluated over numeric features. A 
 split is tried for every distinct numeric value of the feature in the whole 
 data set. Since these are floating point values, they could (and in the 
 customer's case are) all distinct. 200K rows means 200K splits to evaluate 
 every time a node is built on the feature.
 A better approach is to sample percentiles out of the feature and evaluate 
 only those as splits. Really doing that efficiently would require a lot of 
 rewrite. However, there are some modest changes possible which get some of 
 the benefit, and appear to make it run about 3x faster. That is --on a data 
 set that exhibits this problem -- meaning one using numeric features which 
 are generally distinct. Which is not exotic.
 There are comparable but different problems with handling of categorical 
 features, but that's for a different patch.
 I have a patch, but it changes behavior to some extent since it is evaluating 
 only a sample of splits instead of every single possible one. In particular 
 it makes the output of OptIgSplit no longer match the DefaultIgSplit. 
 Although I think the point is that optimized may mean giving different 
 choices of split here, which could yield differing trees. So that test 
 probably has to go.
 (Along the way I found a number of micro-optimizations in this part of the 
 code that added up to maybe a 3% speedup. And fixed an NPE too.)
 I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)