date:20140413

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Sebastian Schelter


On 04/14/2014 08:00 AM, Dmitriy Lyubimov wrote:

not all things unfortunately map gracefully into algebra. But hopefully
some of the whole can still be.


Yes, that's why I was asking Andy if there are enough constructs. If 
not, we might have to add more.




I am even a little bit worried that we may develop almost too much (is
there such thing) of ML before we have a chance to cyrstallize data frames
and perhaps dictionary discussions. these are more tools to keep abstracted.


I think it's a very good thing to have early ML implementations on the 
DSL, because it allows us to validate whether we are on the right path. 
We should start with providing the things that are most popular in 
mahout, like the item-based recommender from MAHOUT-1464. Having a few 
implementations on the DSL also helps with designing new abstractions, 
because for every proposed feature we can look at the existing code and 
see how helpful the new feature would be.




I just don't want Mahout to be yet-another mllib. I shudder every time
somebody says "we want to create a Spark version of (an|the) algorithm".  I
know it will be creating wrong talking points for somebody anxious to draw
parallels.


Totally agree here. Looks history repeats itself from "I want to create 
a Hadoop implementation" to "I want to create a Spark implementation" :)





On Sun, Apr 13, 2014 at 10:51 PM, Sebastian Schelter  wrote:


Andy, that would be awesome. Have you had a look at our new scala DSL [1]?
Does it offer enough constructs for you to rewrite your implementation with
it?

--sebastian


[1] https://mahout.apache.org/users/sparkbindings/home.html


On 04/14/2014 07:47 AM, Andy Twigg wrote:


   +1 to removing present Random Forests. Andy Twigg had provided a

Spark
based Streaming Random Forests impl sometime last year. Its time to
restart
that conversation and integrate that into the codebase if the contributor
is still willing i.e.



I'm happy to contribute this, but as it stands it's written against
spark, even forgetting the 'streaming' aspect. Do you have any advice
on how to proceed?

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Dmitriy Lyubimov

not all things unfortunately map gracefully into algebra. But hopefully
some of the whole can still be.

I am even a little bit worried that we may develop almost too much (is
there such thing) of ML before we have a chance to cyrstallize data frames
and perhaps dictionary discussions. these are more tools to keep abstracted.

I just don't want Mahout to be yet-another mllib. I shudder every time
somebody says "we want to create a Spark version of (an|the) algorithm".  I
know it will be creating wrong talking points for somebody anxious to draw
parallels.

On Sun, Apr 13, 2014 at 10:51 PM, Sebastian Schelter  wrote:

> Andy, that would be awesome. Have you had a look at our new scala DSL [1]?
> Does it offer enough constructs for you to rewrite your implementation with
> it?
>
> --sebastian
>
>
> [1] https://mahout.apache.org/users/sparkbindings/home.html
>
>
> On 04/14/2014 07:47 AM, Andy Twigg wrote:
>
>>   +1 to removing present Random Forests. Andy Twigg had provided a
>>> Spark
>>> based Streaming Random Forests impl sometime last year. Its time to
>>> restart
>>> that conversation and integrate that into the codebase if the contributor
>>> is still willing i.e.
>>>
>>
>> I'm happy to contribute this, but as it stands it's written against
>> spark, even forgetting the 'streaming' aspect. Do you have any advice
>> on how to proceed?
>>
>>
>

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Sebastian Schelter

Andy, that would be awesome. Have you had a look at our new scala DSL 
[1]? Does it offer enough constructs for you to rewrite your 
implementation with it?


--sebastian


[1] https://mahout.apache.org/users/sparkbindings/home.html

On 04/14/2014 07:47 AM, Andy Twigg wrote:

  +1 to removing present Random Forests. Andy Twigg had provided a Spark
based Streaming Random Forests impl sometime last year. Its time to restart
that conversation and integrate that into the codebase if the contributor
is still willing i.e.


I'm happy to contribute this, but as it stands it's written against
spark, even forgetting the 'streaming' aspect. Do you have any advice
on how to proceed?

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Andy Twigg

>  +1 to removing present Random Forests. Andy Twigg had provided a Spark
> based Streaming Random Forests impl sometime last year. Its time to restart
> that conversation and integrate that into the codebase if the contributor
> is still willing i.e.

I'm happy to contribute this, but as it stands it's written against
spark, even forgetting the 'streaming' aspect. Do you have any advice
on how to proceed?

[jira] [Commented] (MAHOUT-1502) Update Naive Bayes Webpage to Current Implementation

2014-04-13 Thread Andrew Palumbo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968009#comment-13968009
 ] 

Andrew Palumbo commented on MAHOUT-1502:


I'm going to submit a patch for MAHOUT-1504 tomorrow afternoon and will then 
begin working on the documentation.  Should be able to take care of this pretty 
quickly.



> Update Naive Bayes Webpage to Current Implementation 
> -
>
> Key: MAHOUT-1502
> URL: https://issues.apache.org/jira/browse/MAHOUT-1502
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Priority: Minor
> Fix For: 1.0
>
>
> Current Naive Bayes page is for pre .7 NB implementation:
> https://mahout.apache.org/users/classification/bayesian.html
> post .7, TF-IDF calculations are preformed outside of NB.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967966#comment-13967966
 ] 

Hudson commented on MAHOUT-1483:


SUCCESS: Integrated in Mahout-Quality #2567 (See 
[https://builds.apache.org/job/Mahout-Quality/2567/])
MAHOUT-1483: Organize links in web site navigation bar (akm: rev 1587081)
* /mahout/trunk/CHANGELOG


> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Andrew Musselman (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Musselman resolved MAHOUT-1483.
--

Resolution: Fixed

Committed and published.

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Work started] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Andrew Musselman (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1483 started by Andrew Musselman.

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Andrew Musselman (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967918#comment-13967918
 ] 

Andrew Musselman commented on MAHOUT-1483:
--

Oh I get it, thanks [~smarthi]

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #813

2014-04-13 Thread Suneel Marthi

Nah it happens most if the time and succeeds the next time when there r fewer 
jobs running.

Sent from my iPhone

> On Apr 13, 2014, at 2:58 PM, Andrew Musselman  
> wrote:
> 
> Looks like a disk space issue; shall we raise an INFRA ticket?
> 
> 
> On Sun, Apr 13, 2014 at 11:17 AM, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
> 
>> See <
>> https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/813/changes
>> 
>> Changes:
>> 
>> [ssc] MAHOUT-1450 Cleaning up clustering documentation
>> 
>> [akm] MAHOUT-1420: Add solr-recommender to examples; typo fix
>> 
>> [akm] MAHOUT-1420: Add solr-recommender to examples
>> 
>> --
>> [...truncated 1725 lines...]
>> A math/src/main/java/org/apache/mahout/math/decomposer/hebbian
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/EigenUpdater.java
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/TrainingState.java
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianUpdater.java
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianSolver.java
>> A math/src/main/java/org/apache/mahout/math/decomposer/lanczos
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>> A
>> math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/EigenStatus.java
>> AU
>> math/src/main/java/org/apache/mahout/math/decomposer/SimpleEigenVerifier.java
>> AUmath/src/main/java/org/apache/mahout/math/SparseMatrix.java
>> A math/src/main/java/org/apache/mahout/math/QR.java
>> AU
>> math/src/main/java/org/apache/mahout/math/CardinalityException.java
>> A
>> math/src/main/java/org/apache/mahout/math/FunctionalMatrixView.java
>> AUmath/src/main/java/org/apache/mahout/math/DenseMatrix.java
>> A
>> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
>> A math/src/main/java/org/apache/mahout/math/DelegatingVector.java
>> A
>> math/src/main/java/org/apache/mahout/math/OrthonormalityVerifier.java
>> A math/src/main/java/org/apache/mahout/math/PersistentObject.java
>> AUmath/src/main/java/org/apache/mahout/math/AbstractMatrix.java
>> AUmath/src/main/java/org/apache/mahout/math/VectorView.java
>> A math/src/main/java/org/apache/mahout/math/PivotedMatrix.java
>> A math/src/main/java/org/apache/mahout/math/DiagonalMatrix.java
>> A math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
>> A math/src/main/java/org/apache/mahout/math/function
>> AU
>> math/src/main/java/org/apache/mahout/math/function/IntIntDoubleFunction.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/ObjectObjectProcedure.java
>> AU
>> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
>> AUmath/src/main/java/org/apache/mahout/math/function/Functions.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/VectorFunction.java
>> AU
>> math/src/main/java/org/apache/mahout/math/function/SquareRootFunction.java
>> AU
>> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/FloatFunction.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/IntIntFunction.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/ObjectProcedure.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/DoubleFunction.java
>> AU
>> math/src/main/java/org/apache/mahout/math/function/IntFunction.java
>> AUmath/src/main/java/org/apache/mahout/math/function/Mult.java
>> A
>> math/src/main/java/org/apache/mahout/math/function/package-info.java
>> AUmath/src/main/java/org/apache/mahout/math/function/PlusMult.java
>> AU
>> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
>> A math/src/main/java/org/apache/mahout/math/ConstantVector.java
>> A math/src/main/java/org/apache/mahout/math/VectorBinaryAssign.java
>> A
>> math/src/main/java/org/apache/mahout/math/SingularValueDecomposition.java
>> A
>> math/src/main/java/org/apache/mahout/math/VectorBinaryAggregate.java
>> A math/src/main/java/org/apache/mahout/math/QRDecomposition.java
>> A math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
>> A math/src/main/java/org/apache/mahout/math/WeightedVector.java
>> A math/src/main/java/org/apache/mahout/math/UpperTriangular.java
>> A math/src/main/java/org/apache/mahout/math/matrix
>> A math/src/main/java/org/apache/mahout/math/matrix/impl
>> A math/src/main/java/org/apache/mahout/math/matrix/linalg
>> A math/src/main/java/org/apache/mahout/math/package-info.java
>> A math/src/main/java/org/apache/mahout/math/Sorting.java
>> A math/src/main/java/org/apache/mahout/math/MatrixTimesOps.java
>>

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Andrew Musselman

I am okay with that, just suggesting a method for future.


On Sun, Apr 13, 2014 at 10:40 AM, Sebastian Schelter <
ssc.o...@googlemail.com> wrote:

> I'd vote against a contrib area at the moment, because it would stand in
> the way of unifying, shrinking and stabilizing the codebase.
>
> --sebastian
> Am 13.04.2014 19:36 schrieb "Andrew Musselman"  >:
>
> >
> > > On Apr 13, 2014, at 10:30 AM, Dmitriy Lyubimov 
> > wrote:
> > >
> > >> On Apr 13, 2014 10:22 AM, "Ted Dunning" 
> wrote:
> > >>
> > >> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov  > >> wrote:
> > >>
> > >>> +1, but more importantly, reject any new author who doesn't agree to
> > >>> explicitly plegdge a multi-year support.
> > >>
> > >> I am a little bit negative about this requirement.  My feeling is that
> > it
> > >> will wind up with accepting naive optimists (the ones we don't want)
> and
> > >> rejecting realists because they know that a true multi-year commitment
> > is
> > >> subject to buffeting by real-life.
> > > I true. I guess i mean more along the criteria lines, not about how we
> > make
> > > the inference. I meant if we really had a way to make reliable
> inference
> > > here. It may well be the case there's no such way. Usually the first
> good
> > > sign is that contributors are sticking to their issue in the first
> place
> > > for some time.
> >
> > This is where a contrib or piggybank-style sandbox could help, so people
> > could submit things "in probation" until they're proven out.
>

Re: Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #813

2014-04-13 Thread Andrew Musselman

Looks like a disk space issue; shall we raise an INFRA ticket?


On Sun, Apr 13, 2014 at 11:17 AM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See <
> https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters-II/813/changes
> >
>
> Changes:
>
> [ssc] MAHOUT-1450 Cleaning up clustering documentation
>
> [akm] MAHOUT-1420: Add solr-recommender to examples; typo fix
>
> [akm] MAHOUT-1420: Add solr-recommender to examples
>
> --
> [...truncated 1725 lines...]
> A math/src/main/java/org/apache/mahout/math/decomposer/hebbian
> AU
>  
> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/EigenUpdater.java
> AU
>  
> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/TrainingState.java
> AU
>  
> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianUpdater.java
> AU
>  
> math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianSolver.java
> A math/src/main/java/org/apache/mahout/math/decomposer/lanczos
> AU
>  
> math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> A
> math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
> AU
>  math/src/main/java/org/apache/mahout/math/decomposer/EigenStatus.java
> AU
>  math/src/main/java/org/apache/mahout/math/decomposer/SimpleEigenVerifier.java
> AUmath/src/main/java/org/apache/mahout/math/SparseMatrix.java
> A math/src/main/java/org/apache/mahout/math/QR.java
> AU
>  math/src/main/java/org/apache/mahout/math/CardinalityException.java
> A
> math/src/main/java/org/apache/mahout/math/FunctionalMatrixView.java
> AUmath/src/main/java/org/apache/mahout/math/DenseMatrix.java
> A
> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
> A math/src/main/java/org/apache/mahout/math/DelegatingVector.java
> A
> math/src/main/java/org/apache/mahout/math/OrthonormalityVerifier.java
> A math/src/main/java/org/apache/mahout/math/PersistentObject.java
> AUmath/src/main/java/org/apache/mahout/math/AbstractMatrix.java
> AUmath/src/main/java/org/apache/mahout/math/VectorView.java
> A math/src/main/java/org/apache/mahout/math/PivotedMatrix.java
> A math/src/main/java/org/apache/mahout/math/DiagonalMatrix.java
> A math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
> A math/src/main/java/org/apache/mahout/math/function
> AU
>  math/src/main/java/org/apache/mahout/math/function/IntIntDoubleFunction.java
> A
> math/src/main/java/org/apache/mahout/math/function/ObjectObjectProcedure.java
> AU
>  math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
> AUmath/src/main/java/org/apache/mahout/math/function/Functions.java
> A
> math/src/main/java/org/apache/mahout/math/function/VectorFunction.java
> AU
>  math/src/main/java/org/apache/mahout/math/function/SquareRootFunction.java
> AU
>  math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
> A
> math/src/main/java/org/apache/mahout/math/function/FloatFunction.java
> A
> math/src/main/java/org/apache/mahout/math/function/IntIntFunction.java
> A
> math/src/main/java/org/apache/mahout/math/function/ObjectProcedure.java
> A
> math/src/main/java/org/apache/mahout/math/function/DoubleFunction.java
> AU
>  math/src/main/java/org/apache/mahout/math/function/IntFunction.java
> AUmath/src/main/java/org/apache/mahout/math/function/Mult.java
> A
> math/src/main/java/org/apache/mahout/math/function/package-info.java
> AUmath/src/main/java/org/apache/mahout/math/function/PlusMult.java
> AU
>  math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
> A math/src/main/java/org/apache/mahout/math/ConstantVector.java
> A math/src/main/java/org/apache/mahout/math/VectorBinaryAssign.java
> A
> math/src/main/java/org/apache/mahout/math/SingularValueDecomposition.java
> A
> math/src/main/java/org/apache/mahout/math/VectorBinaryAggregate.java
> A math/src/main/java/org/apache/mahout/math/QRDecomposition.java
> A math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
> A math/src/main/java/org/apache/mahout/math/WeightedVector.java
> A math/src/main/java/org/apache/mahout/math/UpperTriangular.java
> A math/src/main/java/org/apache/mahout/math/matrix
> A math/src/main/java/org/apache/mahout/math/matrix/impl
> A math/src/main/java/org/apache/mahout/math/matrix/linalg
> A math/src/main/java/org/apache/mahout/math/package-info.java
> A math/src/main/java/org/apache/mahout/math/Sorting.java
> A math/src/main/java/org/apache/mahout/math/MatrixTimesOps.java
> A math/src/main/java/org/apache/mahout/math/list
> A math/src/main/java/org/apache/mahout/math/list/AbstractList.java
> A
> math/src/main/java/org/apache/mahout/math/list/ObjectArrayList.java
> A
> math/src/main/java/org/apache/mahout/math/l

Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #813

2014-04-13 Thread Apache Jenkins Server

See 


Changes:

[ssc] MAHOUT-1450 Cleaning up clustering documentation

[akm] MAHOUT-1420: Add solr-recommender to examples; typo fix

[akm] MAHOUT-1420: Add solr-recommender to examples

--
[...truncated 1725 lines...]
A math/src/main/java/org/apache/mahout/math/decomposer/hebbian
AU
math/src/main/java/org/apache/mahout/math/decomposer/hebbian/EigenUpdater.java
AU
math/src/main/java/org/apache/mahout/math/decomposer/hebbian/TrainingState.java
AU
math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianUpdater.java
AU
math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianSolver.java
A math/src/main/java/org/apache/mahout/math/decomposer/lanczos
AU
math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
A 
math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
AUmath/src/main/java/org/apache/mahout/math/decomposer/EigenStatus.java
AU
math/src/main/java/org/apache/mahout/math/decomposer/SimpleEigenVerifier.java
AUmath/src/main/java/org/apache/mahout/math/SparseMatrix.java
A math/src/main/java/org/apache/mahout/math/QR.java
AUmath/src/main/java/org/apache/mahout/math/CardinalityException.java
A math/src/main/java/org/apache/mahout/math/FunctionalMatrixView.java
AUmath/src/main/java/org/apache/mahout/math/DenseMatrix.java
A 
math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
A math/src/main/java/org/apache/mahout/math/DelegatingVector.java
A math/src/main/java/org/apache/mahout/math/OrthonormalityVerifier.java
A math/src/main/java/org/apache/mahout/math/PersistentObject.java
AUmath/src/main/java/org/apache/mahout/math/AbstractMatrix.java
AUmath/src/main/java/org/apache/mahout/math/VectorView.java
A math/src/main/java/org/apache/mahout/math/PivotedMatrix.java
A math/src/main/java/org/apache/mahout/math/DiagonalMatrix.java
A math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
A math/src/main/java/org/apache/mahout/math/function
AU
math/src/main/java/org/apache/mahout/math/function/IntIntDoubleFunction.java
A 
math/src/main/java/org/apache/mahout/math/function/ObjectObjectProcedure.java
AUmath/src/main/java/org/apache/mahout/math/function/TimesFunction.java
AUmath/src/main/java/org/apache/mahout/math/function/Functions.java
A math/src/main/java/org/apache/mahout/math/function/VectorFunction.java
AU
math/src/main/java/org/apache/mahout/math/function/SquareRootFunction.java
AU
math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
A math/src/main/java/org/apache/mahout/math/function/FloatFunction.java
A math/src/main/java/org/apache/mahout/math/function/IntIntFunction.java
A 
math/src/main/java/org/apache/mahout/math/function/ObjectProcedure.java
A math/src/main/java/org/apache/mahout/math/function/DoubleFunction.java
AUmath/src/main/java/org/apache/mahout/math/function/IntFunction.java
AUmath/src/main/java/org/apache/mahout/math/function/Mult.java
A math/src/main/java/org/apache/mahout/math/function/package-info.java
AUmath/src/main/java/org/apache/mahout/math/function/PlusMult.java
AUmath/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
A math/src/main/java/org/apache/mahout/math/ConstantVector.java
A math/src/main/java/org/apache/mahout/math/VectorBinaryAssign.java
A 
math/src/main/java/org/apache/mahout/math/SingularValueDecomposition.java
A math/src/main/java/org/apache/mahout/math/VectorBinaryAggregate.java
A math/src/main/java/org/apache/mahout/math/QRDecomposition.java
A math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
A math/src/main/java/org/apache/mahout/math/WeightedVector.java
A math/src/main/java/org/apache/mahout/math/UpperTriangular.java
A math/src/main/java/org/apache/mahout/math/matrix
A math/src/main/java/org/apache/mahout/math/matrix/impl
A math/src/main/java/org/apache/mahout/math/matrix/linalg
A math/src/main/java/org/apache/mahout/math/package-info.java
A math/src/main/java/org/apache/mahout/math/Sorting.java
A math/src/main/java/org/apache/mahout/math/MatrixTimesOps.java
A math/src/main/java/org/apache/mahout/math/list
A math/src/main/java/org/apache/mahout/math/list/AbstractList.java
A math/src/main/java/org/apache/mahout/math/list/ObjectArrayList.java
A 
math/src/main/java/org/apache/mahout/math/list/SimpleLongArrayList.java
A math/src/main/java/org/apache/mahout/math/list/package-info.java
A math/src/main/java/org/apache/mahout/mat

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Sebastian Schelter

I'd vote against a contrib area at the moment, because it would stand in
the way of unifying, shrinking and stabilizing the codebase.

--sebastian
Am 13.04.2014 19:36 schrieb "Andrew Musselman" :

>
> > On Apr 13, 2014, at 10:30 AM, Dmitriy Lyubimov 
> wrote:
> >
> >> On Apr 13, 2014 10:22 AM, "Ted Dunning"  wrote:
> >>
> >> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov  >> wrote:
> >>
> >>> +1, but more importantly, reject any new author who doesn't agree to
> >>> explicitly plegdge a multi-year support.
> >>
> >> I am a little bit negative about this requirement.  My feeling is that
> it
> >> will wind up with accepting naive optimists (the ones we don't want) and
> >> rejecting realists because they know that a true multi-year commitment
> is
> >> subject to buffeting by real-life.
> > I true. I guess i mean more along the criteria lines, not about how we
> make
> > the inference. I meant if we really had a way to make reliable inference
> > here. It may well be the case there's no such way. Usually the first good
> > sign is that contributors are sticking to their issue in the first place
> > for some time.
>
> This is where a contrib or piggybank-style sandbox could help, so people
> could submit things "in probation" until they're proven out.

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Andrew Musselman


> On Apr 13, 2014, at 10:30 AM, Dmitriy Lyubimov  wrote:
> 
>> On Apr 13, 2014 10:22 AM, "Ted Dunning"  wrote:
>> 
>> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov > wrote:
>> 
>>> +1, but more importantly, reject any new author who doesn't agree to
>>> explicitly plegdge a multi-year support.
>> 
>> I am a little bit negative about this requirement.  My feeling is that it
>> will wind up with accepting naive optimists (the ones we don't want) and
>> rejecting realists because they know that a true multi-year commitment is
>> subject to buffeting by real-life.
> I true. I guess i mean more along the criteria lines, not about how we make
> the inference. I meant if we really had a way to make reliable inference
> here. It may well be the case there's no such way. Usually the first good
> sign is that contributors are sticking to their issue in the first place
> for some time.

This is where a contrib or piggybank-style sandbox could help, so people could 
submit things "in probation" until they're proven out.

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Andrew Musselman (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967895#comment-13967895
 ] 

Andrew Musselman commented on MAHOUT-1483:
--

Okay thanks.

[~smarthi] Yeah that's on my list.

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Dmitriy Lyubimov

On Apr 13, 2014 10:22 AM, "Ted Dunning"  wrote:
>
> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov wrote:
>
> > +1, but more importantly, reject any new author who doesn't agree to
> > explicitly plegdge a multi-year support.
> >
>
> I am a little bit negative about this requirement.  My feeling is that it
> will wind up with accepting naive optimists (the ones we don't want) and
> rejecting realists because they know that a true multi-year commitment is
> subject to buffeting by real-life.
I true. I guess i mean more along the criteria lines, not about how we make
the inference. I meant if we really had a way to make reliable inference
here. It may well be the case there's no such way. Usually the first good
sign is that contributors are sticking to their issue in the first place
for some time.

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967894#comment-13967894
 ] 

Sebastian Schelter commented on MAHOUT-1483:


You have to explicitly publish the site. Go to https://cms.apache.org/mahout/ 
and click on the publish site link.

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Dmitriy Lyubimov

On Apr 13, 2014 10:21 AM, "Ted Dunning"  wrote:
>
> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov wrote:
>
> > >  * move the MR algorithms into a separate maven module
> > You mean, move  them out  of mahout-core? So the core is for single
machine
> > stuff only? Plus utils? We probably need to refactor core so there's no
> > core at all it seems. Our core, realistically, is utils, mahout-math &
> > math-scala(aka scalabindings), engine-agnostic logical layer of
> > mahout-spark. But for obvious reasons we probably dont want to put all
that
> > in a single module. Maybe at some point later when these things become
more
> > mainstream.
>
>
> This might be viewed as renaming core to be "mr-legacy" and then pulling
> those items we really need out of that.  Math is already separate as are
> scala bindings and similar.

Yes thats what i meant. It looks like it means full dissolution of
mahout-core rather than just moving out mr stuff specifically. I am ok with
that i guess.

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Suneel Marthi (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967892#comment-13967892
 ] 

Suneel Marthi commented on MAHOUT-1483:
---

Andrew, 20newgroups is presently under /clustering folder, that needs to move 
under /classification.  FYI.

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Ted Dunning

On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov wrote:

> +1, but more importantly, reject any new author who doesn't agree to
> explicitly plegdge a multi-year support.
>

I am a little bit negative about this requirement.  My feeling is that it
will wind up with accepting naive optimists (the ones we don't want) and
rejecting realists because they know that a true multi-year commitment is
subject to buffeting by real-life.

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Andrew Musselman (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967891#comment-13967891
 ] 

Andrew Musselman commented on MAHOUT-1483:
--

Got it, so just check changes in and buildbot will release the site?

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Ted Dunning

On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov wrote:

> >  * move the MR algorithms into a separate maven module
> You mean, move  them out  of mahout-core? So the core is for single machine
> stuff only? Plus utils? We probably need to refactor core so there's no
> core at all it seems. Our core, realistically, is utils, mahout-math &
> math-scala(aka scalabindings), engine-agnostic logical layer of
> mahout-spark. But for obvious reasons we probably dont want to put all that
> in a single module. Maybe at some point later when these things become more
> mainstream.

This might be viewed as renaming core to be "mr-legacy" and then pulling
those items we really need out of that.  Math is already separate as are
scala bindings and similar.

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Andrew Musselman

This is a good summary of how I feel too.

> On Apr 13, 2014, at 10:15 AM, Sebastian Schelter  wrote:
> 
> Unfortunately, its not that easy to get enough voluntary work. I issued the 
> third call for working on the documentation today as there are still lots of 
> open issues. That's why I'm trying to suggest a move that involves as few 
> work as possible.
> 
> We should get the MR codebase into a state that we all can live with and then 
> focus on new stuff like the scala DSL.
> 
> --sebastian
> 
> 
> 
> 
>> On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>> The best thing, should be do a plan, and see how much effort do you need to
>> this. Then find out voluntaries to accomplish the task. Quite sure that
>> there a lot of people around there that they are willing to help out.
>> 
>> BR,
>> deneb.
>> 
>> 
>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter :
>> 
>>> Hi,
>>> 
>>> I took some days to let the latest discussion about the state and future
>>> of Mahout go through my head. I think the most important thing to address
>>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>>> are currently unmaintained, documentation is outdated and the original
>>> authors have abandoned Mahout. For some algorithms it is hard to get even
>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>> Sean's comments that letting the code linger around is no option and will
>>> continue to harm Mahout.
>>> 
>>> In the previous discussion, I suggested to make a radical move and aim to
>>> delete this codebase, but there were serious objections from committers and
>>> users that convinced me that there is still usage of and interested in that
>>> codebase.
>>> 
>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>> harming our userbase. On the other hand, I don't see anyone willing to
>>> rework the codebase. Further, the code cannot linger around anymore as it
>>> is doing now, especially when we fail to answer questions or don't provide
>>> documentation.
>>> 
>>> *We have to make a move*!
>>> 
>>> I suggest the following actions with regard to the MR codebase. I hope
>>> that they find consent. If there are objections, please give alternatives,
>>> *keeping everything as-is is not an option*:
>>> 
>>>  * reject any future MR algorithm contributions, prominently state this on
>>> the website and in talks
>>>  * make all existing algorithm code compatible with Hadoop 2, if there is
>>> no one willing to make an existing algorithm compatible, remove the
>>> algorithm
>>>  * deprecate the existing MR algorithms, yet still take bug fix
>>> contributions
>>>  * remove Random Forest as we cannot even answer questions to the
>>> implementation on the mailinglist
>>> 
>>> There are two more actions that I would like to see, but'd be willing to
>>> give up if there are objections:
>>> 
>>>  * move the MR algorithms into a separate maven module
>>>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
>>> but had one user who shouted but never returned to us)
>>> 
>>> Let me know what you think.
>>> 
>>> --sebastian
>

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Ted Dunning

Deneb (Giorgio),

The code involved is really quite heinous and we haven't been able to find
volunteers to maintain this code in the past.

It might be possible to maintain a few selected algorithms, but we really
have to move forward.




On Sun, Apr 13, 2014 at 10:09 AM, Giorgio Zoppi wrote:

> The best thing, should be do a plan, and see how much effort do you need to
> this. Then find out voluntaries to accomplish the task. Quite sure that
> there a lot of people around there that they are willing to help out.
>
> BR,
> deneb.
>
>
> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter :
>
> > Hi,
> >
> > I took some days to let the latest discussion about the state and future
> > of Mahout go through my head. I think the most important thing to address
> > right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
> > are currently unmaintained, documentation is outdated and the original
> > authors have abandoned Mahout. For some algorithms it is hard to get even
> > questions answered on the mailinglist (e.g. RandomForest). I agree with
> > Sean's comments that letting the code linger around is no option and will
> > continue to harm Mahout.
> >
> > In the previous discussion, I suggested to make a radical move and aim to
> > delete this codebase, but there were serious objections from committers
> and
> > users that convinced me that there is still usage of and interested in
> that
> > codebase.
> >
> > That puts us into a "legacy dilemma". We cannot delete the code without
> > harming our userbase. On the other hand, I don't see anyone willing to
> > rework the codebase. Further, the code cannot linger around anymore as it
> > is doing now, especially when we fail to answer questions or don't
> provide
> > documentation.
> >
> > *We have to make a move*!
> >
> > I suggest the following actions with regard to the MR codebase. I hope
> > that they find consent. If there are objections, please give
> alternatives,
> > *keeping everything as-is is not an option*:
> >
> >  * reject any future MR algorithm contributions, prominently state this
> on
> > the website and in talks
> >  * make all existing algorithm code compatible with Hadoop 2, if there is
> > no one willing to make an existing algorithm compatible, remove the
> > algorithm
> >  * deprecate the existing MR algorithms, yet still take bug fix
> > contributions
> >  * remove Random Forest as we cannot even answer questions to the
> > implementation on the mailinglist
> >
> > There are two more actions that I would like to see, but'd be willing to
> > give up if there are objections:
> >
> >  * move the MR algorithms into a separate maven module
> >  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
> > but had one user who shouted but never returned to us)
> >
> > Let me know what you think.
> >
> > --sebastian
> >
>
>
>
> --
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".
>

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Dmitriy Lyubimov

On Apr 13, 2014 9:45 AM, "Sebastian Schelter"  wrote:
>
> Hi,
>
> I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this
on the website and in talks
+1, but more importantly, reject any new author who doesn't agree to
explicitly plegdge a multi-year support.
>  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
Ok, although my gut feeling this would take some time

>  * deprecate the existing MR algorithms, yet still take bug fix
contributions
I foresee a bit smoother mr transition. Deprecation means we loose them in
a release. That is, by the fall release. It would seem to me it would take
longer for us to provide full repleacement and convince ourselves of its
production worthiness.
Also, deprecation implies we can point a user to something else with "use
instead". So i wouldn't deprecate methods just now for which we cannot add
this phrase. As somebody menioned, long tail for deprecation is a good
policy here imo.

>  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist

Do we know a direct email for FPM and random forest authors? I 'd suggest
to ping them one last time. They just may not be tuned to the list. Both
algorithms are kind of in a bread-and -butter category, it would be a huge
hit in coverage to just lose them without any resuscitation attempt
whatsoever.

>
> There are two more actions that I would like to see, but'd be willing to
give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
You mean, move  them out  of mahout-core? So the core is for single machine
stuff only? Plus utils? We probably need to refactor core so there's no
core at all it seems. Our core, realistically, is utils, mahout-math &
math-scala(aka scalabindings), engine-agnostic logical layer of
mahout-spark. But for obvious reasons we probably dont want to put all that
in a single module. Maybe at some point later when these things become more
mainstream.

>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)
>
> Let me know what you think.
>
> --sebastian

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Sebastian Schelter

Unfortunately, its not that easy to get enough voluntary work. I issued 
the third call for working on the documentation today as there are still 
lots of open issues. That's why I'm trying to suggest a move that 
involves as few work as possible.


We should get the MR codebase into a state that we all can live with and 
then focus on new stuff like the scala DSL.


--sebastian




On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:

The best thing, should be do a plan, and see how much effort do you need to
this. Then find out voluntaries to accomplish the task. Quite sure that
there a lot of people around there that they are willing to help out.

BR,
deneb.


2014-04-13 18:45 GMT+02:00 Sebastian Schelter :


Hi,

I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.

In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.

That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.

*We have to make a move*!

I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:

  * reject any future MR algorithm contributions, prominently state this on
the website and in talks
  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
  * deprecate the existing MR algorithms, yet still take bug fix
contributions
  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist

There are two more actions that I would like to see, but'd be willing to
give up if there are objections:

  * move the MR algorithms into a separate maven module
  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)

Let me know what you think.

--sebastian

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Suneel Marthi

I meant to deprecate first (and eventually remove) Canopy clustering. This
is in line with the conversation I had with Ted and Frank at AMS about
weaning users away from the old style Canopy->KMeans clustering to start
using Streaming KMeans. No point in keeping Canopy once users switch to
using Streaming KMeans.


On Sun, Apr 13, 2014 at 1:12 PM, Sebastian Schelter  wrote:

> Do you mean deprecating or removing Canopy clustering? I suggest to
> deprecate all MR code anyways.
>
> --sebastian
>
>
>
> On 04/13/2014 07:11 PM, Suneel Marthi wrote:
>
>  If I may add deprecating Canopy clustering to the list once we get
>> Streaming KMeans working right.
>>
>> On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter 
>> wrote:
>>
>>  Hi,
>>>
>>> I took some days to let the latest discussion about the state and future
>>> of Mahout go through my head. I think the most important thing to address
>>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>>> are currently unmaintained, documentation is outdated and the original
>>> authors have abandoned Mahout. For some algorithms it is hard to get even
>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>> Sean's comments that letting the code linger around is no option and will
>>> continue to harm Mahout.
>>>
>>> In the previous discussion, I suggested to make a radical move and aim to
>>> delete this codebase, but there were serious objections from committers
>>> and
>>> users that convinced me that there is still usage of and interested in
>>> that
>>> codebase.
>>>
>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>> harming our userbase. On the other hand, I don't see anyone willing to
>>> rework the codebase. Further, the code cannot linger around anymore as it
>>> is doing now, especially when we fail to answer questions or don't
>>> provide
>>> documentation.
>>>
>>> *We have to make a move*!
>>>
>>> I suggest the following actions with regard to the MR codebase. I hope
>>> that they find consent. If there are objections, please give
>>> alternatives,
>>> *keeping everything as-is is not an option*:
>>>
>>>   * reject any future MR algorithm contributions, prominently state this
>>> on
>>> the website and in talks
>>>
>>>   +1, this includes the new Frequent Pattern mining impl which is MR
>> based that was provided as a patch few months ago
>>
>>* make all existing algorithm code compatible with Hadoop 2, if there
>>> is
>>> no one willing to make an existing algorithm compatible, remove the
>>> algorithm
>>>
>>>+1. One of the questions I got asked when 0.9 was released was
>> 'when
>> is Mahout gonna be compatible with Yarn and Hadoop 2'?  We should target
>> that for the next major//interim release.
>>
>>* deprecate the existing MR algorithms, yet still take bug fix
>>> contributions
>>>
>>>I guess we'll be removing these in some future release, until
>> then we
>> keep absorbing bug fixes ??
>>
>>
>>* remove Random Forest as we cannot even answer questions to the
>>> implementation on the mailinglist
>>>
>>>+1 to removing present Random Forests. Andy Twigg had provided a
>> Spark
>> based Streaming Random Forests impl sometime last year. Its time to
>> restart
>> that conversation and integrate that into the codebase if the contributor
>> is still willing i.e.
>>
>>
>>> There are two more actions that I would like to see, but'd be willing to
>>> give up if there are objections:
>>>
>>>   * move the MR algorithms into a separate maven module
>>>
>>> +1
>>
>>* remove Frequent Pattern Mining again (we already aimed for that in
>>> 0.9
>>> but had one user who shouted but never returned to us)
>>>
>>>This thing annoys me the most. We had removed this from 0.9 but
>> yet
>> restored it only because some user wanted it and promised to support it.
>> We
>> have not heard from the user again.
>>Its got old MR code that we don't support anymore and this should
>> be
>> purged ASAP.
>>
>>
>>
>>  Let me know what you think.
>>>
>>> --sebastian
>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Sebastian Schelter

Do you mean deprecating or removing Canopy clustering? I suggest to 
deprecate all MR code anyways.


--sebastian


On 04/13/2014 07:11 PM, Suneel Marthi wrote:


If I may add deprecating Canopy clustering to the list once we get
Streaming KMeans working right.

On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter  wrote:


Hi,

I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.

In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.

That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.

*We have to make a move*!

I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:

  * reject any future MR algorithm contributions, prominently state this on
the website and in talks


 +1, this includes the new Frequent Pattern mining impl which is MR
based that was provided as a patch few months ago


  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm


  +1. One of the questions I got asked when 0.9 was released was 'when
is Mahout gonna be compatible with Yarn and Hadoop 2'?  We should target
that for the next major//interim release.


  * deprecate the existing MR algorithms, yet still take bug fix
contributions


  I guess we'll be removing these in some future release, until then we
keep absorbing bug fixes ??



  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist


  +1 to removing present Random Forests. Andy Twigg had provided a Spark
based Streaming Random Forests impl sometime last year. Its time to restart
that conversation and integrate that into the codebase if the contributor
is still willing i.e.



There are two more actions that I would like to see, but'd be willing to
give up if there are objections:

  * move the MR algorithms into a separate maven module


   +1


  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)


  This thing annoys me the most. We had removed this from 0.9 but yet
restored it only because some user wanted it and promised to support it. We
have not heard from the user again.
   Its got old MR code that we don't support anymore and this should be
purged ASAP.




Let me know what you think.

--sebastian

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Suneel Marthi

If I may add deprecating Canopy clustering to the list once we get
Streaming KMeans working right.

On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter  wrote:

> Hi,
>
> I took some days to let the latest discussion about the state and future
> of Mahout go through my head. I think the most important thing to address
> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
> are currently unmaintained, documentation is outdated and the original
> authors have abandoned Mahout. For some algorithms it is hard to get even
> questions answered on the mailinglist (e.g. RandomForest). I agree with
> Sean's comments that letting the code linger around is no option and will
> continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
> delete this codebase, but there were serious objections from committers and
> users that convinced me that there is still usage of and interested in that
> codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
> harming our userbase. On the other hand, I don't see anyone willing to
> rework the codebase. Further, the code cannot linger around anymore as it
> is doing now, especially when we fail to answer questions or don't provide
> documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
> that they find consent. If there are objections, please give alternatives,
> *keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this on
> the website and in talks
>
+1, this includes the new Frequent Pattern mining impl which is MR
based that was provided as a patch few months ago

>  * make all existing algorithm code compatible with Hadoop 2, if there is
> no one willing to make an existing algorithm compatible, remove the
> algorithm
>
 +1. One of the questions I got asked when 0.9 was released was 'when
is Mahout gonna be compatible with Yarn and Hadoop 2'?  We should target
that for the next major//interim release.

>  * deprecate the existing MR algorithms, yet still take bug fix
> contributions
>
 I guess we'll be removing these in some future release, until then we
keep absorbing bug fixes ??


>  * remove Random Forest as we cannot even answer questions to the
> implementation on the mailinglist
>
 +1 to removing present Random Forests. Andy Twigg had provided a Spark
based Streaming Random Forests impl sometime last year. Its time to restart
that conversation and integrate that into the codebase if the contributor
is still willing i.e.

>
> There are two more actions that I would like to see, but'd be willing to
> give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
>
  +1

>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
> but had one user who shouted but never returned to us)
>
 This thing annoys me the most. We had removed this from 0.9 but yet
restored it only because some user wanted it and promised to support it. We
have not heard from the user again.
  Its got old MR code that we don't support anymore and this should be
purged ASAP.



> Let me know what you think.
>
> --sebastian
>

Re: Tackling the "legacy dilemma"

2014-04-13 Thread Giorgio Zoppi

The best thing, should be do a plan, and see how much effort do you need to
this. Then find out voluntaries to accomplish the task. Quite sure that
there a lot of people around there that they are willing to help out.

BR,
deneb.


2014-04-13 18:45 GMT+02:00 Sebastian Schelter :

> Hi,
>
> I took some days to let the latest discussion about the state and future
> of Mahout go through my head. I think the most important thing to address
> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
> are currently unmaintained, documentation is outdated and the original
> authors have abandoned Mahout. For some algorithms it is hard to get even
> questions answered on the mailinglist (e.g. RandomForest). I agree with
> Sean's comments that letting the code linger around is no option and will
> continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
> delete this codebase, but there were serious objections from committers and
> users that convinced me that there is still usage of and interested in that
> codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
> harming our userbase. On the other hand, I don't see anyone willing to
> rework the codebase. Further, the code cannot linger around anymore as it
> is doing now, especially when we fail to answer questions or don't provide
> documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
> that they find consent. If there are objections, please give alternatives,
> *keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this on
> the website and in talks
>  * make all existing algorithm code compatible with Hadoop 2, if there is
> no one willing to make an existing algorithm compatible, remove the
> algorithm
>  * deprecate the existing MR algorithms, yet still take bug fix
> contributions
>  * remove Random Forest as we cannot even answer questions to the
> implementation on the mailinglist
>
> There are two more actions that I would like to see, but'd be willing to
> give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
> but had one user who shouted but never returned to us)
>
> Let me know what you think.
>
> --sebastian
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Tackling the "legacy dilemma"

2014-04-13 Thread Sebastian Schelter


Hi,

I took some days to let the latest discussion about the state and future 
of Mahout go through my head. I think the most important thing to 
address right now is the MapReduce "legacy" codebase. A lot of the MR 
algorithms are currently unmaintained, documentation is outdated and the 
original authors have abandoned Mahout. For some algorithms it is hard 
to get even questions answered on the mailinglist (e.g. RandomForest). I 
agree with Sean's comments that letting the code linger around is no 
option and will continue to harm Mahout.


In the previous discussion, I suggested to make a radical move and aim 
to delete this codebase, but there were serious objections from 
committers and users that convinced me that there is still usage of and 
interested in that codebase.


That puts us into a "legacy dilemma". We cannot delete the code without 
harming our userbase. On the other hand, I don't see anyone willing to 
rework the codebase. Further, the code cannot linger around anymore as 
it is doing now, especially when we fail to answer questions or don't 
provide documentation.


*We have to make a move*!

I suggest the following actions with regard to the MR codebase. I hope 
that they find consent. If there are objections, please give 
alternatives, *keeping everything as-is is not an option*:


 * reject any future MR algorithm contributions, prominently state this 
on the website and in talks
 * make all existing algorithm code compatible with Hadoop 2, if there 
is no one willing to make an existing algorithm compatible, remove the 
algorithm
 * deprecate the existing MR algorithms, yet still take bug fix 
contributions
 * remove Random Forest as we cannot even answer questions to the 
implementation on the mailinglist


There are two more actions that I would like to see, but'd be willing to 
give up if there are objections:


 * move the MR algorithms into a separate maven module
 * remove Frequent Pattern Mining again (we already aimed for that in 
0.9 but had one user who shouted but never returned to us)


Let me know what you think.

--sebastian

Re: [jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-04-13 Thread SriSatish Ambati

Feel free to borrow from the R DSL for H2O -
https://github.com/0xdata/h2o/tree/master/R/h2o-package


On Wed, Mar 26, 2014 at 9:34 PM, Saikat Kanjilal (JIRA) wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13948876#comment-13948876]
>
> Saikat Kanjilal commented on MAHOUT-1490:
> -
>
> Here's a list of features that can exist as part of sample DSL for data
> frames:
> 1) Have some sample data frames as tutorials like R built-in data frame
> called mtcars
>
> The following items below can be associated with name, logical and numeric
> indexing
> 2) Retrieve a data frame column slice with the single square bracket []
> 3) Retrieve a data frame row slice with the single square bracket []
> 4) Retrieve a column vector from a built in data frame
>
>
> Of course an extension to the above would be expressing in core support
>
>
> A few other questions since I have not been as closely entrenched in the
> preivous work that has already been done:
> 1) What version of scala should I use to implement the above?
> 2) Should I do the work in my github repo and at a later point merge it in?
> 3) Speaking of merging will this live in a separate package alongside the
> scala and spark bindings?
>
>
> Thanks for your insight and mentorship on this.
>
> > Data frame R-like bindings
> > --
> >
> > Key: MAHOUT-1490
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> > Project: Mahout
> >  Issue Type: New Feature
> >Reporter: Saikat Kanjilal
> >Assignee: Dmitriy Lyubimov
> >   Original Estimate: 20h
> >  Remaining Estimate: 20h
> >
> > Create Data frame R-like bindings for spark
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Re: KMeans|| opinions

2014-04-13 Thread Maciej Mazur

Yes, I looked at this one. Picture shows the general idea, but this is a
very high level.
I'm rather interested in implementation.
Initial clusters are modified in ClusterIterator.iterateMR
CIMapper setup - It loads regular hdfs file or is it a cache, why? It looks
as if it's regular file. Isn't it a huge overhead to load this file at
setup? What's the reasonable size of this file?
Why everything is written during cleanup call? Is it used instead of
combiner?
General idea - During each iteration the output of the previous iteration
(reduce) is loadead at setup, then the model is updated and propagated
(cleanup) to reducers. What happens if one cluster is very large ~70% of
all elements in the data set? Cleanup helps with that?
Stop condition - isConverged - Does it compare outputs (2 files) from last
two iterations or is it encapsulated in Cluster class?

On Sun, Apr 13, 2014 at 4:32 PM, Sebastian Schelter  wrote:

> Did you check the website at https://mahout.apache.org/
> users/clustering/k-means-clustering.html ?
>
>
> On 04/13/2014 02:53 PM, Maciej Mazur wrote:
>
>> Recently I've been looking into K-means implementation.
>> I want to understand how it works, and why it was designed this way.
>> Could you give me some overview?
>> I see that during the setup clusters are read from the file. Is it a
>> distributed cache?  What's the maxmial size of this file, what's the
>> maximum value of k?
>> There is nothing outputed during the call of map function, everything is
>> saved at cleanup. Why?
>> Are there any docs concerning implementation?
>>
>> Thanks,
>> Maciej
>>
>>
>> On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning 
>> wrote:
>>
>>
>>> Well, you could view this as a performance bug in the implementation of
>>> the linear algebra.
>>>
>>> It certainly is, however, an odd interpretation of transpose.  I have
>>> used
>>> a similar trick in r to use sparse matrices as a counter but it always
>>> worried me a bit.
>>>
>>> Sent from my iPhone
>>>
>>>  On Apr 8, 2014, at 17:49, Dmitriy Lyubimov  wrote:

 Problem is, I want to use linear algebra to handle that, not
 combine().

>>>
>>>
>>
>

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-13 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967872#comment-13967872
 ] 

Pat Ferrel commented on MAHOUT-1464:


I have some time this week so working on it. 

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967871#comment-13967871
 ] 

Hudson commented on MAHOUT-1450:


SUCCESS: Integrated in Mahout-Quality #2566 (See 
[https://builds.apache.org/job/Mahout-Quality/2566/])
MAHOUT-1450 Cleaning up clustering documentation (ssc: rev 1587003)
* /mahout/trunk/CHANGELOG


> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this p

[jira] [Updated] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1464:
---

Attachment: MAHOUT-1464.patch

Updated patch. Removed the similarity form of LLR, removed math specific code 
that was adressed in MAHOUT-1508, added nicer output and did a few cosmetic 
changes.

I think this code is ready to be tested on a cluster, does anybody have time 
for that?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1391) Possibility to disable confusion matrix in naive bayes

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1391:
---

Resolution: Not a Problem
Status: Resolved  (was: Patch Available)

If you have labels in your testset that are not in your trainingset, then your 
setup is flawed and you should not run that test.

> Possibility to disable confusion matrix in naive bayes
> --
>
> Key: MAHOUT-1391
> URL: https://issues.apache.org/jira/browse/MAHOUT-1391
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.8
>Reporter: Mansur Iqbal
> Fix For: 1.0
>
> Attachments: MAHOUT-1391.patch
>
>
> Sometimes confusion matrix is to big and not really necessary.
> And there is another case for the possibility:
> If you split a dataset with many labels with random selection percent to 
> testdataset and trainingdataset, it could happen, that there are 
> classes/labels in testdata, which do not appear in the trainingdataset. By 
> creating a model with the trainingdata the created labelindex does not 
> include some labels from testdata. Therefore if you test on this model with 
> the testdata, mahout tries to create a confusion matrix with the labels from 
> testdata which are not included in the labelindex and throws an exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Andrew Musselman (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Musselman reassigned MAHOUT-1483:


Assignee: Andrew Musselman

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1439) Update talks on Mahout

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1439:
---

Component/s: Documentation

> Update talks on Mahout
> --
>
> Key: MAHOUT-1439
> URL: https://issues.apache.org/jira/browse/MAHOUT-1439
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> The talks listed on our homepage seem to end somewhere in 2012.
> I know that there have been tons of other talks on Mahout since then, I've 
> added mine already. It would be great if everybody who knows of additional 
> talks would paste them here, so I can add them to the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-874.
---

Resolution: Won't Fix

Closing this issue as there has not been activity for more than half a year. 
The topic is still important though, if someone wants to restart work on that, 
this issue can be reopened.

> Extract Writables into a separate module to allow smaller dependencies
> --
>
> Key: MAHOUT-874
> URL: https://issues.apache.org/jira/browse/MAHOUT-874
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Ted Dunning
> Fix For: 1.0
>
>
> The theory is that we can have a smaller jar if we only include writable 
> classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like 
> to discuss.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1421) Adapter package for all mahout tools

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1421.


Resolution: Won't Fix

I'm against keeping jira's open for too long. We've had very bad experiences 
with this (issues lingering around for ages). I think that the scope of this 
issue might simply be too broad.

I'm closing this one as it aims to create adapters for *all* mahout tools. It 
would be great if you could start with a specific single algorithm/family of 
algorithms and create a new issue for that. Hope that makes sense too you.

> Adapter package for all mahout tools
> 
>
> Key: MAHOUT-1421
> URL: https://issues.apache.org/jira/browse/MAHOUT-1421
> Project: Mahout
>  Issue Type: Improvement
>Reporter: jay vyas
> Fix For: 1.0
>
>
> Hi mahout.  I'd like to create an umbrella JIRA for allowing more runtime 
> flexibility for reading different types of input formats for all mahout 
> tasks. 
> Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which 
> typically requires:
> 1) Hashing text entries into numbers
> 2) Saving the large transformed file on disk
> 3) Feeding it into classifieer 
> Instead, we could build adapters into the classifier itself, so that the user
> 1) Specifies input file to recommender
> 2) Specifies transformation class which converts each record of input to 3 
> column recommender format
> 3) Runs internal mahout recommender directly against the data
> And thus the user could easily run mahout against existing data without 
> having to munge it to much.
> This package might be called something like "org.apache.mahout.adapters", and 
> would over time provide flexible adapters to the core mahout algorithm 
> implementations, so that folks wouldnt have to worry so much about 
> vectors/csv transformers/etc... 
> Any thoughts on this?  If positive feedback I can submit an initial patch to 
> get things started.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: KMeans|| opinions

2014-04-13 Thread Sebastian Schelter

Did you check the website at 
https://mahout.apache.org/users/clustering/k-means-clustering.html ?


On 04/13/2014 02:53 PM, Maciej Mazur wrote:

Recently I've been looking into K-means implementation.
I want to understand how it works, and why it was designed this way.
Could you give me some overview?
I see that during the setup clusters are read from the file. Is it a
distributed cache?  What's the maxmial size of this file, what's the
maximum value of k?
There is nothing outputed during the call of map function, everything is
saved at cleanup. Why?
Are there any docs concerning implementation?

Thanks,
Maciej


On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning  wrote:



Well, you could view this as a performance bug in the implementation of
the linear algebra.

It certainly is, however, an odd interpretation of transpose.  I have used
a similar trick in r to use sparse matrices as a counter but it always
worried me a bit.

Sent from my iPhone


On Apr 8, 2014, at 17:49, Dmitriy Lyubimov  wrote:

Problem is, I want to use linear algebra to handle that, not
combine().

[jira] [Resolved] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1450.


Resolution: Fixed

Closing this, thank for your help. Btw there is already another ticket for 
Streaming k-Means MAHOUT-1468 would you like to help with that?

> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random initial centroid

[jira] [Commented] (MAHOUT-1421) Adapter package for all mahout tools

2014-04-13 Thread jay vyas (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967840#comment-13967840
 ] 

jay vyas commented on MAHOUT-1421:
--

Hi sebastian.  I the other docs JIRAs, like MAHOUT-1441 , are sort of blocking 
me.  for example:

- Its not clear from the code what the right way to put adapters around 
Existing Recommenders.
- There are still some remaining documentation holes in the clusterers (i.e. 
MAHOUT-1441). 

So I think its good to keep this JIRA open, but first we will need to let the 
docs catch up, what do you think?  

> Adapter package for all mahout tools
> 
>
> Key: MAHOUT-1421
> URL: https://issues.apache.org/jira/browse/MAHOUT-1421
> Project: Mahout
>  Issue Type: Improvement
>Reporter: jay vyas
> Fix For: 1.0
>
>
> Hi mahout.  I'd like to create an umbrella JIRA for allowing more runtime 
> flexibility for reading different types of input formats for all mahout 
> tasks. 
> Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which 
> typically requires:
> 1) Hashing text entries into numbers
> 2) Saving the large transformed file on disk
> 3) Feeding it into classifieer 
> Instead, we could build adapters into the classifier itself, so that the user
> 1) Specifies input file to recommender
> 2) Specifies transformation class which converts each record of input to 3 
> column recommender format
> 3) Runs internal mahout recommender directly against the data
> And thus the user could easily run mahout against existing data without 
> having to munge it to much.
> This package might be called something like "org.apache.mahout.adapters", and 
> would over time provide flexible adapters to the core mahout algorithm 
> implementations, so that folks wouldnt have to worry so much about 
> vectors/csv transformers/etc... 
> Any thoughts on this?  If positive feedback I can submit an initial patch to 
> get things started.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-13 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967825#comment-13967825
 ] 

Ted Dunning commented on MAHOUT-1450:
-

Fine idea (with cross links).



> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random initial centroids)
> -cd → convergence delta parameter
> -ow → overwrite
> -x → specifying number of k-mea

[jira] [Commented] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

2014-04-13 Thread Pavan Kumar N (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967829#comment-13967829
 ] 

Pavan Kumar N commented on MAHOUT-1450:
---

Agreed. So I guess we can close this issue. If required we could have a link 
that directs to StreamingKMeans but that is your call.

> Cleaning up clustering documentation on mahout website 
> ---
>
> Key: MAHOUT-1450
> URL: https://issues.apache.org/jira/browse/MAHOUT-1450
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
> Environment: This affects all mahout versions
>Reporter: Pavan Kumar N
>  Labels: documentation, newbie
> Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some 
> dead links. Need to clean them and replace with new links (if there are any). 
> Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the 
> hyperlink "Cluster computing and MapReduce", second second sentence the 
> hyperlink "here" and last sentence the hyperlink 
> "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm"; are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: 
> http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup 
> mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the 
> following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget 
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-0
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is 
> cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random i

Re: KMeans|| opinions

2014-04-13 Thread Maciej Mazur

Recently I've been looking into K-means implementation.
I want to understand how it works, and why it was designed this way.
Could you give me some overview?
I see that during the setup clusters are read from the file. Is it a
distributed cache?  What's the maxmial size of this file, what's the
maximum value of k?
There is nothing outputed during the call of map function, everything is
saved at cleanup. Why?
Are there any docs concerning implementation?

Thanks,
Maciej

On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning  wrote:

>
> Well, you could view this as a performance bug in the implementation of
> the linear algebra.
>
> It certainly is, however, an odd interpretation of transpose.  I have used
> a similar trick in r to use sparse matrices as a counter but it always
> worried me a bit.
>
> Sent from my iPhone
>
> > On Apr 8, 2014, at 17:49, Dmitriy Lyubimov  wrote:
> >
> > Problem is, I want to use linear algebra to handle that, not
> > combine().
>

[jira] [Commented] (MAHOUT-1278) Improve inheritance of Apache parent pom

2014-04-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967815#comment-13967815
 ] 

Sean Owen commented on MAHOUT-1278:
---

POM version 14 is out now too:

http://svn.apache.org/viewvc/maven/pom/tags/apache-14/pom.xml?r1=HEAD&r2=1434717&diff_format=h

> Improve inheritance of Apache parent pom
> 
>
> Key: MAHOUT-1278
> URL: https://issues.apache.org/jira/browse/MAHOUT-1278
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 1.0
>
>
> We should update dependency on Apache parent pom (currently we depend on 
> version 9, while 13 is already released).
> With the upgrade we should make the most of inherited settings and plugin 
> versions from Apache parent pom, so we override only what is necessary, to 
> make Mahout POMs smaller and easier to maintain.
> Hopefully by the time this issue gets worked on, 
> maven-remote-resources-plugin with 
> [MRRESOURCES-53|http://jira.codehaus.org/browse/MRRESOURCES-53] fix will be 
> released (since we're affected by it - test jars are being resolved from 
> remote repository instead from the current build / rector repository), and 
> updated Apache parent pom released.
> Implementation note: Mahout parent module and mahout-buildtools module both 
> use Apache parent pom as parent, so both need to be updated. 
> mahout-buildtools module had to be separate from the mahout parent pom (not 
> inheriting it), so that buildtools module can be referenced as dependency of 
> various source quality check plugins.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1495) Create a website describing the distributed item-based recommender

2014-04-13 Thread Andrew Psaltis (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967813#comment-13967813
 ] 

Andrew Psaltis commented on MAHOUT-1495:


I have not had a chance to get started on this yet.

> Create a website describing the distributed item-based recommender
> --
>
> Key: MAHOUT-1495
> URL: https://issues.apache.org/jira/browse/MAHOUT-1495
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1496) Create a website describing the distributed ALS recommender

2014-04-13 Thread jian wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967812#comment-13967812
 ] 

jian wang commented on MAHOUT-1496:
---

Will try to submit a patch next week.

> Create a website describing the distributed ALS recommender
> ---
>
> Key: MAHOUT-1496
> URL: https://issues.apache.org/jira/browse/MAHOUT-1496
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Documentation, Documentation, Documentation

2014-04-13 Thread Sebastian Schelter


Hi,

this is another reminder that we still have to finish our documentation 
improvements! The website looks shiny now and there have been lots of 
discussions about new directions but we still have some work todo in 
cleaning up webpages. We should especially make sure that the examples work.


Please help with that, anyone who is willing to sacrifice some time, go 
through a website and try out the steps described is of great help to 
the project. It would also be awesome to get some help in creating a few 
new pages, especially for the recommenders.


Here's the list of documentation related jira's for 1.0:

https://issues.apache.org/jira/browse/MAHOUT-1441?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Documentation%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best,
Sebastian

[jira] [Commented] (MAHOUT-1249) Build tools around mahout to check the training error of factorization and automatically detect convergence

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967809#comment-13967809
 ] 

Sebastian Schelter commented on MAHOUT-1249:


[~bswangjian] what's the status here?

> Build tools around mahout to check the training error of factorization and 
> automatically detect convergence
> ---
>
> Key: MAHOUT-1249
> URL: https://issues.apache.org/jira/browse/MAHOUT-1249
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 1.0
>Reporter: Saikat Kanjilal
> Fix For: 1.0
>
>
> The goal of this task is to check the training error of the factorization 
> during the computation to make it automatically detect convergence.  The goal 
> is not to have to specify the number of iterations as a parameter needed for 
> convergence.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967810#comment-13967810
 ] 

Sebastian Schelter commented on MAHOUT-1178:


[~gokhancapan] what's the status here?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1445) Create an intro for item based recommender

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1445:
---

Component/s: Documentation

> Create an intro for item based recommender
> --
>
> Key: MAHOUT-1445
> URL: https://issues.apache.org/jira/browse/MAHOUT-1445
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Maciej Mazur
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1443) Update "How to release page"

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1443:
---

Component/s: Documentation

> Update "How to release page"
> 
>
> Key: MAHOUT-1443
> URL: https://issues.apache.org/jira/browse/MAHOUT-1443
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> I have a favor to ask. Could you have a look at the "How To Release" 
> page and tell if the information there is still correct? I'm asking you 
> this because you have done the latest release. After your OK, I'll go 
> and improve formatting and readability of that page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1446) Create an intro for matrix factorization

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1446:
---

Component/s: Documentation

> Create an intro for matrix factorization
> 
>
> Key: MAHOUT-1446
> URL: https://issues.apache.org/jira/browse/MAHOUT-1446
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Maciej Mazur
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1468) Creating a new page for StreamingKMeans documentation on mahout website

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1468:
---

Component/s: Documentation

> Creating a new page for StreamingKMeans documentation on mahout website
> ---
>
> Key: MAHOUT-1468
> URL: https://issues.apache.org/jira/browse/MAHOUT-1468
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Pavan Kumar N
>  Labels: Documentation
> Fix For: 1.0
>
>
> Separate page required on Streaming K Means algorithm description and 
> overview, explaining the various parameters can be used in streamingkmeans, 
> strategy for parallelization, link to this paper: 
> http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1462) Cleaning up Random Forests documentation on Mahout website

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1462:
---

Component/s: Documentation

> Cleaning up Random Forests documentation on Mahout website
> --
>
> Key: MAHOUT-1462
> URL: https://issues.apache.org/jira/browse/MAHOUT-1462
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Manoj Awasthi
> Fix For: 1.0
>
>
> Following are the items which need be added or changed. 
> I think this page can be broken into two segments. First can be following: 
> 
> Introduction to Random Forests
> Random Forests are an ensemble machine learning technique originally proposed 
> by Leo Breiman (UCB) which uses classification and regression trees as 
> underlying classification mechanism. Trademark to Random Forest is maintained 
> by Leo Breiman and Adele Cutler. 
> Official website for Random Forests: 
> http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> Original paper published: http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> 
> Second section can following: 
> 
> Classifying with random forests with Mahout
> 
> This section can be what it is right now on the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1495) Create a website describing the distributed item-based recommender

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967808#comment-13967808
 ] 

Sebastian Schelter commented on MAHOUT-1495:


[~apsaltis] what's the status here?

> Create a website describing the distributed item-based recommender
> --
>
> Key: MAHOUT-1495
> URL: https://issues.apache.org/jira/browse/MAHOUT-1495
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1496) Create a website describing the distributed ALS recommender

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1496:
---

Component/s: Documentation

> Create a website describing the distributed ALS recommender
> ---
>
> Key: MAHOUT-1496
> URL: https://issues.apache.org/jira/browse/MAHOUT-1496
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1495) Create a website describing the distributed item-based recommender

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1495:
---

Component/s: Documentation

> Create a website describing the distributed item-based recommender
> --
>
> Key: MAHOUT-1495
> URL: https://issues.apache.org/jira/browse/MAHOUT-1495
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1277) Lose dependency on custom commons-cli

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1277:
---

Fix Version/s: 1.0

> Lose dependency on custom commons-cli
> -
>
> Key: MAHOUT-1277
> URL: https://issues.apache.org/jira/browse/MAHOUT-1277
> Project: Mahout
>  Issue Type: Improvement
>  Components: build, CLI
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 1.0
>
>
> In 0.8 we have dependency on custom commons-cli fork 
> org.apache.mahout.commons:commons-cli. There are no sources for this under 
> Mahout version control. It's a risk keeping this as dependency.
> We should either use officially released and maintained commons-cli version, 
> or if it's not sufficient for Mahout project needs, replace it completely 
> with something else (e.g. like [JCommander|http://jcommander.org/]).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1497) mahout resplit not producing splited files

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967807#comment-13967807
 ] 

Sebastian Schelter commented on MAHOUT-1497:


What's the status here?

> mahout resplit not producing splited files
> --
>
> Key: MAHOUT-1497
> URL: https://issues.apache.org/jira/browse/MAHOUT-1497
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.8
>Reporter: Reinis Vicups
> Fix For: 1.0
>
>
> when I run "mahout resplit", I get the output below but no split files are 
> being produced.
> {code}
> support@hadoop1:~$ mahout resplit --input .../final/clusteredPoints/part-m-* 
> --output .../final/split --numSplits 4
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using 
> /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/bin/../lib/hadoop/bin/hadoop 
> and HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: 
> /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/mahout/mahout-examples-0.8-cdh5.0.0-beta-2-job.jar
> 14/03/28 16:22:50 WARN driver.MahoutDriver: No resplit.props found on 
> classpath, will use command-line arguments only
> Writing 4 splits
> Writing split 0
> Writing split 1
> Writing split 2
> Writing split 3
> 14/03/28 16:22:52 INFO driver.MahoutDriver: Program took 2077 ms (Minutes: 
> 0.034614)
> {code}
> The folder "cluteredPoints" passed to --input of resplit contains clustered 
> points generated by k-means algorithm from mahout.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1497) mahout resplit not producing splited files

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1497:
---

Fix Version/s: 1.0

> mahout resplit not producing splited files
> --
>
> Key: MAHOUT-1497
> URL: https://issues.apache.org/jira/browse/MAHOUT-1497
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.8
>Reporter: Reinis Vicups
> Fix For: 1.0
>
>
> when I run "mahout resplit", I get the output below but no split files are 
> being produced.
> {code}
> support@hadoop1:~$ mahout resplit --input .../final/clusteredPoints/part-m-* 
> --output .../final/split --numSplits 4
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using 
> /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/bin/../lib/hadoop/bin/hadoop 
> and HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: 
> /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/mahout/mahout-examples-0.8-cdh5.0.0-beta-2-job.jar
> 14/03/28 16:22:50 WARN driver.MahoutDriver: No resplit.props found on 
> classpath, will use command-line arguments only
> Writing 4 splits
> Writing split 0
> Writing split 1
> Writing split 2
> Writing split 3
> 14/03/28 16:22:52 INFO driver.MahoutDriver: Program took 2077 ms (Minutes: 
> 0.034614)
> {code}
> The folder "cluteredPoints" passed to --input of resplit contains clustered 
> points generated by k-means algorithm from mahout.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1432) Reduce Hadoop dependencies - NaiveBayesModel

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1432:
---

  Environment: (was: 
)
Fix Version/s: 1.0

> Reduce Hadoop dependencies - NaiveBayesModel
> 
>
> Key: MAHOUT-1432
> URL: https://issues.apache.org/jira/browse/MAHOUT-1432
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.9
>Reporter: Frank Scholten
>Assignee: Frank Scholten
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1432.patch, MAHOUT-1432.patch
>
>
> After the discussion on the mailinglist I decided to clean up some Hadoop 
> dependencies and started with the NaiveBayesModel. 
> I like to commit this soon and move on to the next algorithm or model to 
> clean it up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1278) Improve inheritance of Apache parent pom

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967806#comment-13967806
 ] 

Sebastian Schelter commented on MAHOUT-1278:


[~sslavic] Whats the progress here?

> Improve inheritance of Apache parent pom
> 
>
> Key: MAHOUT-1278
> URL: https://issues.apache.org/jira/browse/MAHOUT-1278
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 1.0
>
>
> We should update dependency on Apache parent pom (currently we depend on 
> version 9, while 13 is already released).
> With the upgrade we should make the most of inherited settings and plugin 
> versions from Apache parent pom, so we override only what is necessary, to 
> make Mahout POMs smaller and easier to maintain.
> Hopefully by the time this issue gets worked on, 
> maven-remote-resources-plugin with 
> [MRRESOURCES-53|http://jira.codehaus.org/browse/MRRESOURCES-53] fix will be 
> released (since we're affected by it - test jars are being resolved from 
> remote repository instead from the current build / rector repository), and 
> updated Apache parent pom released.
> Implementation note: Mahout parent module and mahout-buildtools module both 
> use Apache parent pom as parent, so both need to be updated. 
> mahout-buildtools module had to be separate from the mahout parent pom (not 
> inheriting it), so that buildtools module can be referenced as dependency of 
> various source quality check plugins.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1387) Create page for release notes

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1387:
---

Fix Version/s: 1.0

> Create page for release notes
> -
>
> Key: MAHOUT-1387
> URL: https://issues.apache.org/jira/browse/MAHOUT-1387
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 0.8
>Reporter: Isabel Drost-Fromm
>Priority: Minor
> Fix For: 1.0
>
>
> Starting 0.6 our release notes are published on our main web page - 
> interleaved with other news items.
> For reference it would be good to have one canonical go-to page for past 
> release notes on our main Apache CMS powered web page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1432) Reduce Hadoop dependencies - NaiveBayesModel

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967805#comment-13967805
 ] 

Sebastian Schelter commented on MAHOUT-1432:


Whats the progress here? 

> Reduce Hadoop dependencies - NaiveBayesModel
> 
>
> Key: MAHOUT-1432
> URL: https://issues.apache.org/jira/browse/MAHOUT-1432
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.9
> Environment: 
>Reporter: Frank Scholten
>Assignee: Frank Scholten
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1432.patch, MAHOUT-1432.patch
>
>
> After the discussion on the mailinglist I decided to clean up some Hadoop 
> dependencies and started with the NaiveBayesModel. 
> I like to commit this soon and move on to the next algorithm or model to 
> clean it up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1385) Caching Encoders don't cache

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1385:
---

Fix Version/s: 1.0

> Caching Encoders don't cache
> 
>
> Key: MAHOUT-1385
> URL: https://issues.apache.org/jira/browse/MAHOUT-1385
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Johannes Schulte
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch
>
>
> The Caching... line of encoders contains code of caching the hash code terms 
> added to the vector. However, the method "hashForProbe" inside this classes 
> is never called as the signature has String for the parameter original form 
> (instead of byte[] like other encoders).
> Changing this to byte[] however would lose the java String internal caching 
> of the Strings hash code , that is used as a key in the cache map, triggering 
> another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1278) Improve inheritance of Apache parent pom

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1278:
---

Fix Version/s: 1.0

> Improve inheritance of Apache parent pom
> 
>
> Key: MAHOUT-1278
> URL: https://issues.apache.org/jira/browse/MAHOUT-1278
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 1.0
>
>
> We should update dependency on Apache parent pom (currently we depend on 
> version 9, while 13 is already released).
> With the upgrade we should make the most of inherited settings and plugin 
> versions from Apache parent pom, so we override only what is necessary, to 
> make Mahout POMs smaller and easier to maintain.
> Hopefully by the time this issue gets worked on, 
> maven-remote-resources-plugin with 
> [MRRESOURCES-53|http://jira.codehaus.org/browse/MRRESOURCES-53] fix will be 
> released (since we're affected by it - test jars are being resolved from 
> remote repository instead from the current build / rector repository), and 
> updated Apache parent pom released.
> Implementation note: Mahout parent module and mahout-buildtools module both 
> use Apache parent pom as parent, so both need to be updated. 
> mahout-buildtools module had to be separate from the mahout parent pom (not 
> inheriting it), so that buildtools module can be referenced as dependency of 
> various source quality check plugins.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1483:
---

Fix Version/s: 1.0

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1483) Organize links in web site navigation bar

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967804#comment-13967804
 ] 

Sebastian Schelter commented on MAHOUT-1483:


[~andrew.musselman] you should go ahead and change the link structure. You can 
checkout the webpage from svn (the site folder).

> Organize links in web site navigation bar
> -
>
> Key: MAHOUT-1483
> URL: https://issues.apache.org/jira/browse/MAHOUT-1483
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Some links in the drop-down menus in the navigation bar are inconsistent.
> Under "Basics", there are some links whose path starts with '/users/basics', 
> one that starts with '/users/dim-reduction', one that starts with 
> '/users/clustering', and one that starts with '/users/sparkbindings'.  These 
> inconsistencies aren't terrible but are a little confusing for maintaining 
> the site.
> Under "Classification", there are some that start with 
> '/users/classification', some that start with '/users/stuff', and there's a 
> clustering example (20 newsgroups example).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1484) Spectral algorithm for HMMs

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967803#comment-13967803
 ] 

Sebastian Schelter commented on MAHOUT-1484:


Any progress here?

> Spectral algorithm for HMMs
> ---
>
> Key: MAHOUT-1484
> URL: https://issues.apache.org/jira/browse/MAHOUT-1484
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Emaad Manzoor
>Priority: Minor
>
> Following up with this 
> [comment|https://issues.apache.org/jira/browse/MAHOUT-396?focusedCommentId=12898284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12898284]
>  by [~isabel] on the sequential HMM 
> [proposal|https://issues.apache.org/jira/browse/MAHOUT-396], is there any 
> interest in a spectral algorithm as described in: "A spectral algorithm for 
> learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)"?
> I would like to take up this effort.
> This will enable learning the parameters of and making predictions with a HMM 
> in a single step. At its core, the algorithm involves computing estimates 
> from triples of observations, performing an SVD and then some matrix 
> multiplications.
> This could also form the base for an implementation of "Hilbert Space 
> Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, 
> A. Smola)".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1505) structure of clusterdump's JSON output

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1505:
---

Fix Version/s: 1.0

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1355) Frequent Pattern Mining algorithms for Mahout

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1355.


Resolution: Won't Fix

I don't see value in adding new MR Code. Shout if you disagree.

> Frequent Pattern Mining algorithms for Mahout
> -
>
> Key: MAHOUT-1355
> URL: https://issues.apache.org/jira/browse/MAHOUT-1355
> Project: Mahout
>  Issue Type: New Feature
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.9
>Reporter: Sandy Moens
>Priority: Minor
> Attachments: MAHOUT-1355.patch, MAHOUT-1355_V2.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> We implemented frequent pattern mining algorithms for Hadoop and adapted them 
> to Mahout. We used "PFP" (now deprecated) as a benchmark and these 
> implementations perform better in terms of speed and memory footprint. The 
> details of the implementations can be found in the paper Frequent Pattern 
> Mining for BigData ( http://adrem.ua.ac.be/bigfim )
> We have been maintaining the project for a while in GitLab ( 
> https://gitlab.com/adrem/bigfim ). Documentation for adaptation ( 
> Readme-Mahout.md ) and usage in mahout ( Mahout-wiki.md ) can be found there.
> We are open to any modification and/or improvement requests to make it more 
> worthwhile for the Mahout project. We, as the research group, volunteer to 
> maintain FPM algorithms as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1506) Creation of affinity matrix for spectral clustering

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1506:
---

Fix Version/s: 1.0

> Creation of affinity matrix for spectral clustering
> ---
>
> Key: MAHOUT-1506
> URL: https://issues.apache.org/jira/browse/MAHOUT-1506
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 1.0
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
> Fix For: 1.0
>
>
> I wanted to get this discussion going, since I think this is a critical 
> blocker for any kind of documentation update on spectral clustering (I can't 
> update the documentation until the algorithm is useful, and it won't be 
> useful until there's a built-in method for converting raw data to an affinity 
> matrix).
> Namely, I'm wondering what kind of "raw" data should this algorithm be 
> expecting (anything that k-means expects, basically?), and what are the data 
> structures associated with this? I've created a proof-of-concept for how 
> pairwise affinity generation could work.
> https://github.com/magsol/Hadoop-Affinity
> It's a two-step job, but if the data structures in the input data format 
> provides 1) the total number of data points, and 2) for each data point to 
> know its index in the overall set, then the first job can be scrapped 
> entirely and affinity generation will consist of 1 MR task.
> (discussions on Spark / h20 pending, of course)
> Mainly this is an engineering problem at this point. Let me know your 
> thoughts and I'll get this done (I'm out of town the next 10 days for my 
> wedding/honeymoon, will get to this on my return).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1412) Build warning due to multiple Scala versions

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1412.


Resolution: Not a Problem

> Build warning due to multiple Scala versions
> 
>
> Key: MAHOUT-1412
> URL: https://issues.apache.org/jira/browse/MAHOUT-1412
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Frank Scholten
>Priority: Minor
>
> I see the following build warning:
> 22:42:07 [WARNING]  Expected all dependencies to require Scala version: 2.9.3
> 22:42:07 [WARNING]  org.apache.mahout:mahout-math-scala:1.0-SNAPSHOT requires 
> scala version: 2.9.3
> 22:42:07 [WARNING]  org.scalatest:scalatest_2.9.2:1.9.1 requires scala 
> version: 2.9.2
> 22:42:07 [WARNING] Multiple versions of scala libraries detected!
> Which version should we use?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967801#comment-13967801
 ] 

Sebastian Schelter commented on MAHOUT-1498:


Could you provide a patch for your changes?

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1498:
---

Fix Version/s: 1.0

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1490) Data frame R-like bindings

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1490:
---

Fix Version/s: 1.0

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1421) Adapter package for all mahout tools

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967800#comment-13967800
 ] 

Sebastian Schelter commented on MAHOUT-1421:


[~jayunit100], what's the status here?

> Adapter package for all mahout tools
> 
>
> Key: MAHOUT-1421
> URL: https://issues.apache.org/jira/browse/MAHOUT-1421
> Project: Mahout
>  Issue Type: Improvement
>Reporter: jay vyas
> Fix For: 1.0
>
>
> Hi mahout.  I'd like to create an umbrella JIRA for allowing more runtime 
> flexibility for reading different types of input formats for all mahout 
> tasks. 
> Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which 
> typically requires:
> 1) Hashing text entries into numbers
> 2) Saving the large transformed file on disk
> 3) Feeding it into classifieer 
> Instead, we could build adapters into the classifier itself, so that the user
> 1) Specifies input file to recommender
> 2) Specifies transformation class which converts each record of input to 3 
> column recommender format
> 3) Runs internal mahout recommender directly against the data
> And thus the user could easily run mahout against existing data without 
> having to munge it to much.
> This package might be called something like "org.apache.mahout.adapters", and 
> would over time provide flexible adapters to the core mahout algorithm 
> implementations, so that folks wouldnt have to worry so much about 
> vectors/csv transformers/etc... 
> Any thoughts on this?  If positive feedback I can submit an initial patch to 
> get things started.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1421) Adapter package for all mahout tools

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1421:
---

Fix Version/s: 1.0

> Adapter package for all mahout tools
> 
>
> Key: MAHOUT-1421
> URL: https://issues.apache.org/jira/browse/MAHOUT-1421
> Project: Mahout
>  Issue Type: Improvement
>Reporter: jay vyas
> Fix For: 1.0
>
>
> Hi mahout.  I'd like to create an umbrella JIRA for allowing more runtime 
> flexibility for reading different types of input formats for all mahout 
> tasks. 
> Specifically, I'd like to start with the FreeTextRecommenderAdapeter, which 
> typically requires:
> 1) Hashing text entries into numbers
> 2) Saving the large transformed file on disk
> 3) Feeding it into classifieer 
> Instead, we could build adapters into the classifier itself, so that the user
> 1) Specifies input file to recommender
> 2) Specifies transformation class which converts each record of input to 3 
> column recommender format
> 3) Runs internal mahout recommender directly against the data
> And thus the user could easily run mahout against existing data without 
> having to munge it to much.
> This package might be called something like "org.apache.mahout.adapters", and 
> would over time provide flexible adapters to the core mahout algorithm 
> implementations, so that folks wouldnt have to worry so much about 
> vectors/csv transformers/etc... 
> Any thoughts on this?  If positive feedback I can submit an initial patch to 
> get things started.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1428) Recommending already consumed items

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967799#comment-13967799
 ] 

Sebastian Schelter commented on MAHOUT-1428:


This is a good idea, would like to contribute a patch for that?

> Recommending already consumed items
> ---
>
> Key: MAHOUT-1428
> URL: https://issues.apache.org/jira/browse/MAHOUT-1428
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Reporter: Mario Levitin
> Fix For: 1.0
>
>
> Mahout does not recommend items which are already consumed by the user.
> For example,
> In the getAllOtherItems method of GenericUserBasedRecommender class there is 
> the following line
> possibleItemIDs.removeAll(dataModel.getItemIDsFromUser(theUserID));  
> which removes user's items from the possibleItemIDs to prevent these items 
> from being recommended to the user. This is ok for many recommendation cases 
> but for many other cases it is not. 
> The Recommender classes  (I mean all of them, NN-based and SVD-based as well 
> as hadoop and non-hadoop versions) might have a parameter for this for 
> excluding or not excluding user items in the returned recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1428) Recommending already consumed items

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1428:
---

Fix Version/s: 1.0

> Recommending already consumed items
> ---
>
> Key: MAHOUT-1428
> URL: https://issues.apache.org/jira/browse/MAHOUT-1428
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Reporter: Mario Levitin
> Fix For: 1.0
>
>
> Mahout does not recommend items which are already consumed by the user.
> For example,
> In the getAllOtherItems method of GenericUserBasedRecommender class there is 
> the following line
> possibleItemIDs.removeAll(dataModel.getItemIDsFromUser(theUserID));  
> which removes user's items from the possibleItemIDs to prevent these items 
> from being recommended to the user. This is ok for many recommendation cases 
> but for many other cases it is not. 
> The Recommender classes  (I mean all of them, NN-based and SVD-based as well 
> as hadoop and non-hadoop versions) might have a parameter for this for 
> excluding or not excluding user items in the returned recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1445) Create an intro for item based recommender

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1445:
---

Fix Version/s: 1.0

> Create an intro for item based recommender
> --
>
> Key: MAHOUT-1445
> URL: https://issues.apache.org/jira/browse/MAHOUT-1445
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Maciej Mazur
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1435) Website Redesign

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1435.


Resolution: Fixed

> Website Redesign 
> -
>
> Key: MAHOUT-1435
> URL: https://issues.apache.org/jira/browse/MAHOUT-1435
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Attachments: mahout2.jpg
>
>
> We decided to redesign the website to make it look nicer and easier to read.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1430) GSOC 2014 Proposal of implementing a new recommender

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1430.


Resolution: Won't Fix

GSoC is now running.

> GSOC 2014 Proposal of implementing a new recommender
> 
>
> Key: MAHOUT-1430
> URL: https://issues.apache.org/jira/browse/MAHOUT-1430
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Mihai Pitu
>  Labels: features, gsoc, mentor
>
> I would like to ask about possibilities of implementing Sparse Linear Methods 
> (SLIM) recommender in Mahout during GSOC 2014.
> The SLIM algorithm generates efficient recommendations and its performance is 
> shown in the original paper 
> (http://glaros.dtc.umn.edu/gkhome/fetch/papers/SLIM2011icdm.pdf). The study 
> demonstrates that SLIM outperforms traditional algorithms (such as itemkNN, 
> userkNN, SVD or Matrix Factorization approaches) on various data-sets in 
> terms of run-time and recommendation quality. The algorithm can be 
> paralellized and Map-Reduce can help us achieve that.
> I am aware of real world systems that are using SLIM as a recommendation 
> engine (e.g. Mendeley: http://www.slideshare.net/MarkLevy/efficient-slides) 
> and I think it represents the state-of-the-art in collaborative filtering 
> right now.
> Would this be an interesting addition to Mahout and is somebody interested in 
> mentoring this at Google Summer of Code 2014?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1431) Comparison of Mahout 0.8 vs mahout 0.9 in EMR

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967797#comment-13967797
 ] 

Sebastian Schelter commented on MAHOUT-1431:


Any progess here? Otherwise I'll close the ticket soon.

> Comparison of Mahout 0.8 vs mahout 0.9 in EMR
> -
>
> Key: MAHOUT-1431
> URL: https://issues.apache.org/jira/browse/MAHOUT-1431
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.8, 0.9
>Reporter: yannis ats
>  Labels: performance
>
> Hi all,
> i tested mahout 0.8 and 0.9 in mahout emr with a large dataset as input and 
> i performed kmeans experiments with both versions in amazon EMR.
> What i found is that mahout 0.8 is faster than mahout 0.9
> in particular i observed that mahout 0.8 is performing less iterations and 
> every iteration of kmeans is faster than mahout 0.9.Every iteration in mahout 
> 0.8 is twice as fast as that of 0.9
> the hadoop version was 1.0.x and the input of the data was roughly 2 million 
> datapoints with dimensionality of 1800.
> The input parameters in both experiments were exactly the same,modulo the 
> initialization which was random in both cases and i can understand that this 
> may affect the convergence(the amount of iterations),but i am baffled by the 
> fact that every iteration takes almost twice the time in 0.9 vs 0.8
> Is this normal?is this  expected?
> thank you in advance for your time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1446) Create an intro for matrix factorization

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1446:
---

Fix Version/s: 1.0

> Create an intro for matrix factorization
> 
>
> Key: MAHOUT-1446
> URL: https://issues.apache.org/jira/browse/MAHOUT-1446
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Maciej Mazur
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1456) The wikipediaXMLSplitter example fails with "heap size" error

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1456:
---

Fix Version/s: 1.0

> The wikipediaXMLSplitter example fails with "heap size" error
> -
>
> Key: MAHOUT-1456
> URL: https://issues.apache.org/jira/browse/MAHOUT-1456
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.9
> Environment: Solaris 11.1 \
> Hadoop 2.3.0 \
> Maven 3.2.1 \
> JDK 1.7.0_07-b10 \
>Reporter: mahmood
>  Labels: Heap,, mahout,, wikipediaXMLSplitter
> Fix For: 1.0
>
>
> 1- The XML file is 
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
> 2- When I run "mahout wikipediaXMLSplitter -d 
> enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64", it stuck at 
> chunk #571 and after 30 minutes it fails to continue with the java heap size 
> error. Previous chunks are created rapidly (10 chunks per second).
> 3- Increasing the heap size via "-Xmx4096m" option doesn't work.
> 4- No matter what is the configuration, it seems that there is a memory leak 
> that eat all space.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1444) Check whether the examples on the website still work

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1444.


Resolution: Won't Fix

We should have individual tickets for these examples.

> Check whether the examples on the website still work
> 
>
> Key: MAHOUT-1444
> URL: https://issues.apache.org/jira/browse/MAHOUT-1444
> Project: Mahout
>  Issue Type: Test
>Reporter: Maciej Mazur
>
> https://mahout.apache.org/users/clustering/clustering-of-synthetic-control-data.html
> https://mahout.apache.org/users/classification/wikipedia-bayes-example.html
> https://mahout.apache.org/users/clustering/twenty-newsgroups.html
> https://mahout.apache.org/users/classification/breiman-example.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1461) The tour

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1461:
---

Fix Version/s: 1.0

Any progress on this issue?

> The tour
> 
>
> Key: MAHOUT-1461
> URL: https://issues.apache.org/jira/browse/MAHOUT-1461
> Project: Mahout
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 0.8, 0.9
> Environment: Discovered while working in .8 of Mahout
>Reporter: Scott
>Assignee: Sebastian Schelter
>  Labels: documentation
> Fix For: 1.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The documentation at 
> https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line
>  is out of date.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1456) The wikipediaXMLSplitter example fails with "heap size" error

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967793#comment-13967793
 ] 

Sebastian Schelter commented on MAHOUT-1456:


The steps for contributing a patch are described here: 
https://mahout.apache.org/developers/how-to-contribute.html

> The wikipediaXMLSplitter example fails with "heap size" error
> -
>
> Key: MAHOUT-1456
> URL: https://issues.apache.org/jira/browse/MAHOUT-1456
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.9
> Environment: Solaris 11.1 \
> Hadoop 2.3.0 \
> Maven 3.2.1 \
> JDK 1.7.0_07-b10 \
>Reporter: mahmood
>  Labels: Heap,, mahout,, wikipediaXMLSplitter
> Fix For: 1.0
>
>
> 1- The XML file is 
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
> 2- When I run "mahout wikipediaXMLSplitter -d 
> enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64", it stuck at 
> chunk #571 and after 30 minutes it fails to continue with the java heap size 
> error. Previous chunks are created rapidly (10 chunks per second).
> 3- Increasing the heap size via "-Xmx4096m" option doesn't work.
> 4- No matter what is the configuration, it seems that there is a memory leak 
> that eat all space.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1462) Cleaning up Random Forests documentation on Mahout website

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1462:
---

Fix Version/s: 1.0

> Cleaning up Random Forests documentation on Mahout website
> --
>
> Key: MAHOUT-1462
> URL: https://issues.apache.org/jira/browse/MAHOUT-1462
> Project: Mahout
>  Issue Type: Bug
>Reporter: Manoj Awasthi
> Fix For: 1.0
>
>
> Following are the items which need be added or changed. 
> I think this page can be broken into two segments. First can be following: 
> 
> Introduction to Random Forests
> Random Forests are an ensemble machine learning technique originally proposed 
> by Leo Breiman (UCB) which uses classification and regression trees as 
> underlying classification mechanism. Trademark to Random Forest is maintained 
> by Leo Breiman and Adele Cutler. 
> Official website for Random Forests: 
> http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> Original paper published: http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> 
> Second section can following: 
> 
> Classifying with random forests with Mahout
> 
> This section can be what it is right now on the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1462) Cleaning up Random Forests documentation on Mahout website

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967791#comment-13967791
 ] 

Sebastian Schelter commented on MAHOUT-1462:


Whats the status here?

> Cleaning up Random Forests documentation on Mahout website
> --
>
> Key: MAHOUT-1462
> URL: https://issues.apache.org/jira/browse/MAHOUT-1462
> Project: Mahout
>  Issue Type: Bug
>Reporter: Manoj Awasthi
> Fix For: 1.0
>
>
> Following are the items which need be added or changed. 
> I think this page can be broken into two segments. First can be following: 
> 
> Introduction to Random Forests
> Random Forests are an ensemble machine learning technique originally proposed 
> by Leo Breiman (UCB) which uses classification and regression trees as 
> underlying classification mechanism. Trademark to Random Forest is maintained 
> by Leo Breiman and Adele Cutler. 
> Official website for Random Forests: 
> http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> Original paper published: http://oz.berkeley.edu/~breiman/randomforest2001.pdf
> 
> Second section can following: 
> 
> Classifying with random forests with Mahout
> 
> This section can be what it is right now on the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1310) Mahout support windows

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1310:
---

Fix Version/s: 1.0

> Mahout support windows
> --
>
> Key: MAHOUT-1310
> URL: https://issues.apache.org/jira/browse/MAHOUT-1310
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
> Environment: Operation system: Windows
>Reporter: Sergey Svinarchuk
>Assignee: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1310.patch
>
>
> Update mahout for build it on Windows with hadoop 2 and add bin/mahout.cmd



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1310) Mahout support windows

2014-04-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967790#comment-13967790
 ] 

Sebastian Schelter commented on MAHOUT-1310:


Can someone with a windows system have a look at the patch?

> Mahout support windows
> --
>
> Key: MAHOUT-1310
> URL: https://issues.apache.org/jira/browse/MAHOUT-1310
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
> Environment: Operation system: Windows
>Reporter: Sergey Svinarchuk
>Assignee: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1310.patch
>
>
> Update mahout for build it on Windows with hadoop 2 and add bin/mahout.cmd



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1348) Why does Mahout RecommenderJob does not support User Based Recommendation, It only Item based similarity recommendation

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1348.


Resolution: Won't Fix

Currently there are no plans to extend the functionality of the MR recommenders

> Why does Mahout RecommenderJob does not support User Based Recommendation, It 
> only Item based similarity recommendation
> ---
>
> Key: MAHOUT-1348
> URL: https://issues.apache.org/jira/browse/MAHOUT-1348
> Project: Mahout
>  Issue Type: Question
>  Components: Collaborative Filtering
> Environment: Mahout in distributed environment
>Reporter: Anusha R
>
> Why does Mahout RecommenderJob does not support User Based similarity 
> Recommendation, It only Item based similarity recommendation. 
> Is it not possible to do in hadoop set up?
> or it reflects in any upcoming version?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1331) Class AbstractVector.java index is an int, increase to long

2014-04-13 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1331.


Resolution: Won't Fix

No activity for half a year, closing.

> Class AbstractVector.java index is an int, increase to long
> ---
>
> Key: MAHOUT-1331
> URL: https://issues.apache.org/jira/browse/MAHOUT-1331
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Max Weule
>
> While using mahout the following problem appears: 
> org.apache.mahout.math.IndexException: Index -1469322379 is outside allowable 
> range of [0,2147483647) at 
> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395).
> We have indexes above 2147483647 and they are exceeding the limit of 
> integers. Plz consider increasing index from int to long.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

1 2 >

1 - 100 of 102 matches

Mail list logo