date:20140217

[jira] [Commented] (MAHOUT-1417) Random decision forest implementation fails in Hadoop 2

2014-02-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903074#comment-13903074
]

Sean Owen commented on MAHOUT-1417:
---

Ah thanks for catching that test failure!

Random decision forest implementation fails in Hadoop 2
---

Key: MAHOUT-1417
URL: https://issues.apache.org/jira/browse/MAHOUT-1417
Project: Mahout
Issue Type: Bug
Components: Classification
Affects Versions: 0.7, 0.8, 0.9
Environment: CDH 4.5.0.1 + Mahout 0.7+patches
Reporter: Sean Owen
Labels: classifier, random-decision-forests, rdf
Fix For: 1.0

Attachments: MAHOUT-1417.patch

Original Estimate: 24h
Remaining Estimate: 24h

We've observed two errors in the RDF implementation, one of which stops it
from working on Hadoop 2 (at least I think it is Hadoop 2 only), and one of
which just makes the workload quite imbalanced.
A key piece of logic in PartialBuilder.java queries mapred.map.tasks to know
the total number of mappers. However this has never been guaranteed to be set
to the number of mappers; it is how a caller sets a default number of
mappers, which may be overridden by Hadoop, and which defaults to 1.
I suspect that this may have actually been set, in some or all cases, to the
number of mappers in Hadoop 1, but I am not sure. Certainly, sometimes it
will happen to be set to a value that equals the number of mappers used.
But when it doesn't it causes the distribution of trees to mappers to be
quite wrong. For example, with 20 trees and 8 mappers in one example, I find
that mapred.map.tasks=1. Logging messages indicate that mapper 0 handles all
trees (0-19), mapper 1 handles non-existent 20-39, etc.
The result is that most mappers do nothing and one does everything. This
results in empty part-m-x files. And, that in turn fails the job. (This
part I also suspect is new, or situation-specific, behavior in Hadoop 2. In
any event, this code should never have idle mappers and fixing that avoids
whatever is going on there.)
There's a second less serious issue in how trees are assigned to mappers.
When the number of trees is not a multiple of the number of mappers, the
remainer is assigned entirely to mapper 0. So with 20 trees and 8 mappers,
all mappers build 2 trees, but mapper 0 builds 6. This is unnecessarily
imbalanced.
Patch coming once I can verify the fix, but current proposal is to:
- Compute the number of maps ahead of time using TextInputFormat and set
mapred.map.tasks
- Fix the method that computes trees per mapper to spread as evenly as
possible (i.e. all mappers build either N or N+1 trees)

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: Mahout 0.9 Release Notes - First Draft

2014-02-17 Thread Ted Dunning

On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

Switched LDA implementation from using Dirtichlet to Collapsed
 Variational Bayes (CVB)


This line should read:

Switched LDA implementation from using Gibb's sampling to Collapsed
Variational Bayes (CVB)


Otherwise, it looks pretty good.

Re: Mahout 0.9 Release Notes - First Draft

2014-02-17 Thread Ellen Friedman

Hi Suneel,

Thanks for notes. I'm inquiring about status of the notes and update to the
website to announce 0.9: Ted has reviewed the release notes - were you
waiting for additional input or are they ready to go on the website? Are
you the one who updates the site?

I've been asked to write a short blog on the release but wanted to wait
until the site is updated.

Thanks much
Ellen

On Tue, Feb 11, 2014 at 10:06 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

Here's a draft of the Release Notes for Mahout 0.9, Please review the same.

The Apache Mahout PMC is pleased to announce the release of Mahout 0.9.
Mahout's goal is to build scalable machine learning libraries focused
primarily in the areas of collaborative filtering (recommenders),
clustering and classification (known collectively as the 3Cs), as well
as the
necessary infrastructure to support those implementations including, but
not limited to, math packages for statistics, linear algebra and others
as well as Java primitive collections, local and distributed vector and
matrix classes and a variety of integrative code to work with popular
packages like Apache Hadoop, Apache Lucene, Apache HBase, Apache
Cassandra and much more. The 0.9 release is mainly a clean up release in
preparation for an upcoming 1.0 release targeted for first half of 2014,
but there are a few
significant new features, which are highlighted below.

To get started with Apache Mahout 0.9, download the release artifacts and
signatures at http://www.apache.org/dyn/closer.cgi/mahout or visit the
central Maven repository.

As with any release, we wish to thank all of the users and contributors
to Mahout. Please see the CHANGELOG [1] and JIRA Release Notes [2] for
individual credits, as there are too many to list here.

GETTING STARTED

In the release package, the examples directory contains several working
examples of the core
functionality available in Mahout. These can be run via scripts in the
examples/bin
directory and will prompt you for more information to help you try things
out.
Most examples do not need a Hadoop cluster in order to run.

RELEASE HIGHLIGHTS

The highlights of the Apache Mahout 0.9 release include, but are not
limited to the list below. For further information, see the included
CHANGELOG[1] file.

- MAHOUT-1297: Scala DSL Bindings for Mahout Math Linear Algebra.
See
http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html
- MAHOUT-1288: Recommenders as a Search. See
https://github.com/pferrel/solr-recommender
- MAHOUT-1364: Upgrade Mahout to Lucene 4.6.1
- MAHOUT-1361: Online Algorithm for computing accurate Quantiles using
1-dimensional Clustering
See
https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdffor
the details.
- MAHOUT-1265: MultiLayer Perceptron (MLP) classifier
This is an early implementation of MLP to solicit user feedback, needs
to be integrated into Mahout's processing pipeline to work with Mahout's
vectors.

- Removed Deprecated algorithms as they have been either replaced by
better performing algorithms or lacked user support and maintenance.

- the usual bug fixes. See [2] for more information on the 0.9 release.

A total of 113 separate JIRA issues were addressed in this release.

The following algorithms that were marked deprecated in 0.8 have been
removed in 0.9:

- From Clustering:
Switched LDA implementation from using Dirtichlet to Collapsed
Variational Bayes (CVB)

Meanshift

MinHash - removed due to poor performance, lack of support and lack of
usage

- From Classification (both are sequential implementations)

Winnow - lack of actual usage and support

Perceptron - lack of actual usage and support

- Collaborative Filtering
SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone
and org.apache.mahout.cf.taste.impl.recommender.slopeone
Distributed pseudo recommender in
org.apache.mahout.cf.taste.hadoop.pseudo
TreeClusteringRecommender in
org.apache.mahout.cf.taste.impl.recommender

- Mahout Math
Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

CONTRIBUTING

Mahout is always looking for contributions focused on the 3Cs. If you are
interested in contributing, please see our contribution page
http://mahout.apache.org/developers/how-to-contribute.html or contact us
via email at dev@mahout.apache.org.

As the project moves towards a 1.0 release, the community will be focused
on key algorithms that are proven to scale in production and have seen
wide-spread adoption.

[1]
http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markuppathrev=1563661
[2]
https://issues.apache.org/jira/browse/MAHOUT-1411?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%20%220.9%22

On Monday, December 23, 2013 7:41 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

On Sun, Dec 22, 2013 at

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread peng

If SSVD is not designed for such eigenvector problem. Then I would vote 
for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both 
algorithms on symmetric case and SSVD is much faster and more accurate 
than its competitor.


Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

In PageRank I'm afraid I have no other option than eigenvector
\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

SSVD is very probably better than Lanczos for any large decomposition.
  That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:


Just asking for possible replacement of our Lanczos-based PageRank
implementation. - Peng

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread Ted Dunning

For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

 If SSVD is not designed for such eigenvector problem. Then I would vote
 for retaining the Lanczos algorithm.
 However, I would like to see the opposite case, I have tested both
 algorithms on symmetric case and SSVD is much faster and more accurate than
 its competitor.

 Yours Peng

 On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

 In PageRank I'm afraid I have no other option than eigenvector
 \lambda, but not singular vector u  v:) The PageRank in Mahout was
 removed with other graph-based algorithm.

 On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

 SSVD is very probably better than Lanczos for any large decomposition.
   That said, it does SVD, not eigen decomposition which means that the
 question of symmetrical matrices or positive definiteness doesn't much
 matter.

 Do you really need eigen-decomposition?



 On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

  Just asking for possible replacement of our Lanczos-based PageRank
 implementation. - Peng

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread peng


Agreed, and this is the case where Lanczos algorithm is obsolete.
My point is: if SSVD is unable to find the eigenvector of asymmetric 
matrix (this is a common formulation of PageRank, and some random 
walks, and many other things), then we still have to rely on 
large-scale Lanczos algorithm.


On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:


If SSVD is not designed for such eigenvector problem. Then I would vote
for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:


In PageRank I'm afraid I have no other option than eigenvector
\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:


SSVD is very probably better than Lanczos for any large decomposition.
   That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

  Just asking for possible replacement of our Lanczos-based PageRank

implementation. - Peng

Build failed in Jenkins: mahout-nightly » Mahout Release Package #1500

2014-02-17 Thread Apache Jenkins Server

See 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/1500/

--
[INFO] 
[INFO] 
[INFO] Building Mahout Release Package 1.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-distribution 
---
[INFO] Deleting 
https://builds.apache.org/job/mahout-nightly/org.apache.mahout$mahout-distribution/ws/target
[INFO] 
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-distribution ---
[INFO] 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-distribution
 removed.
[INFO] 
[INFO] --- maven-assembly-plugin:2.4:single (bin-assembly) @ 
mahout-distribution ---
[INFO] Reading assembly descriptor: src/main/assembly/bin.xml

Build failed in Jenkins: mahout-nightly #1500

2014-02-17 Thread Apache Jenkins Server

See https://builds.apache.org/job/mahout-nightly/1500/

--
[...truncated 1977 lines...]
[INFO] Copying jackson-mapper-asl-1.9.12.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-mapper-asl-1.9.12.jar
[INFO] Copying lucene-benchmark-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-benchmark-4.6.1.jar
[INFO] Copying junit-4.11.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/junit-4.11.jar
[INFO] Copying lucene-sandbox-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-sandbox-4.6.1.jar
[INFO] Copying slf4j-api-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-api-1.7.5.jar
[INFO] Copying randomizedtesting-runner-2.0.15.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/randomizedtesting-runner-2.0.15.jar
[INFO] Copying jcl-over-slf4j-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jcl-over-slf4j-1.7.5.jar
[INFO] Copying mahout-math-1.0-SNAPSHOT-tests.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-math-1.0-SNAPSHOT-tests.jar
[INFO] Copying jaxb-api-2.2.2.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-api-2.2.2.jar
[INFO] Copying jettison-1.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jettison-1.1.jar
[INFO] Copying lucene-spatial-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-spatial-4.6.1.jar
[INFO] Copying commons-httpclient-3.0.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-httpclient-3.0.1.jar
[INFO] Copying commons-compress-1.4.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-compress-1.4.1.jar
[INFO] Copying hamcrest-core-1.3.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/hamcrest-core-1.3.jar
[INFO] Copying commons-lang-2.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-lang-2.4.jar
[INFO] Copying lucene-queries-4.6.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/lucene-queries-4.6.1.jar
[INFO] Copying commons-codec-1.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-codec-1.4.jar
[INFO] Copying solr-commons-csv-3.5.0.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/solr-commons-csv-3.5.0.jar
[INFO] Copying commons-collections-3.2.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-collections-3.2.1.jar
[INFO] Copying xercesImpl-2.9.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xercesImpl-2.9.1.jar
[INFO] Copying mahout-core-1.0-SNAPSHOT.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-core-1.0-SNAPSHOT.jar
[INFO] Copying xstream-1.4.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/xstream-1.4.4.jar
[INFO] Copying mahout-integration-1.0-SNAPSHOT.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/mahout-integration-1.0-SNAPSHOT.jar
[INFO] Copying guava-16.0.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/guava-16.0.jar
[INFO] Copying slf4j-log4j12-1.7.5.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/slf4j-log4j12-1.7.5.jar
[INFO] Copying cglib-nodep-2.2.2.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/cglib-nodep-2.2.2.jar
[INFO] Copying commons-configuration-1.6.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-configuration-1.6.jar
[INFO] Copying commons-io-2.4.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-io-2.4.jar
[INFO] Copying jackson-xc-1.7.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jackson-xc-1.7.1.jar
[INFO] Copying jersey-json-1.8.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jersey-json-1.8.jar
[INFO] Copying commons-math-2.1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/commons-math-2.1.jar
[INFO] Copying jaxb-impl-2.2.3-1.jar to 
https://builds.apache.org/job/mahout-nightly/ws/trunk/examples/target/dependency/jaxb-impl-2.2.3-1.jar
[INFO] Copying jersey-core-1.8.jar to

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread Dmitriy Lyubimov

I bet page rank in mllib in spark finds stationary distribution much faster.
On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:

 Agreed, and this is the case where Lanczos algorithm is obsolete.
 My point is: if SSVD is unable to find the eigenvector of asymmetric
 matrix (this is a common formulation of PageRank, and some random walks,
 and many other things), then we still have to rely on large-scale Lanczos
 algorithm.

 On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

 For the symmetric case, SVD is eigen decomposition.




 On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

  If SSVD is not designed for such eigenvector problem. Then I would vote
 for retaining the Lanczos algorithm.
 However, I would like to see the opposite case, I have tested both
 algorithms on symmetric case and SSVD is much faster and more accurate
 than
 its competitor.

 Yours Peng

 On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

  In PageRank I'm afraid I have no other option than eigenvector
 \lambda, but not singular vector u  v:) The PageRank in Mahout was
 removed with other graph-based algorithm.

 On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

  SSVD is very probably better than Lanczos for any large decomposition.
That said, it does SVD, not eigen decomposition which means that the
 question of symmetrical matrices or positive definiteness doesn't much
 matter.

 Do you really need eigen-decomposition?



 On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

   Just asking for possible replacement of our Lanczos-based PageRank

 implementation. - Peng

[jira] [Commented] (MAHOUT-1417) Random decision forest implementation fails in Hadoop 2

Re: Mahout 0.9 Release Notes - First Draft

Re: Mahout 0.9 Release Notes - First Draft

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

Build failed in Jenkins: mahout-nightly » Mahout Release Package #1500

Build failed in Jenkins: mahout-nightly #1500

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

9 matches

Site Navigation

Mail list logo

Footer information