from:"Sebastian Schelter"

Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-19 Thread Sebastian Schelter

Congrats!

2018-07-19 9:31 GMT+02:00 Peng Zhang :

> Congrats Andrew!
>
> On Thu, Jul 19, 2018 at 04:01 Andrew Musselman  >
> wrote:
>
> > Thanks Andy, looking forward to it! Thank you too for your support and
> > dedication the past two years; here's to continued progress!
> >
> > Best
> > Andrew
> >
> > On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo 
> > wrote:
> > > Please join me in congratulating Andrew Musselman as the new Chair of
> > > the
> > > Apache Mahout Project Management Committee. I would like to thank
> > > Andrew
> > > for stepping up, all of us who have worked with him over the years
> > > know his
> > > dedication to the project to be invaluable.  I look forward to Andrew
> > > taking taking the project into the future.
> > >
> > > Thank you,
> > >
> > > Andy
> >
>

[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546112#comment-15546112
 ] 

Sebastian Schelter commented on MAHOUT-1884:


I know that this is already supported internally, I want to expose it as 
optional parameters to drmDfsRead. I disagree that caching an input matrix to 
read is always intended by the users, at least I want to be able to retain 
control over what is cached and what not.

> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>    Reporter: Sebastian Schelter
>    Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-03 Thread Sebastian Schelter (JIRA)

Sebastian Schelter created MAHOUT-1884:
--

 Summary: Allow specification of dimensions of a DRM
 Key: MAHOUT-1884
 URL: https://issues.apache.org/jira/browse/MAHOUT-1884
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.12.2
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Minor


Currently, in many cases, a DRM must be read to compute its dimensions when a 
user calls nrow or ncol. This also implicitly caches the corresponding DRM.

In some cases, the user actually knows the matrix dimensions (e.g., when the 
matrices are synthetically generated, or when some metadata about them is 
known). In such cases, the user should be able to specify the dimensions upon 
creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1748) Mahout DSL for Flink: switch to Flink Scala API

2015-06-24 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599247#comment-14599247
 ] 

Sebastian Schelter commented on MAHOUT-1748:


+1 makes sense

 Mahout DSL for Flink: switch to Flink Scala API
 ---

 Key: MAHOUT-1748
 URL: https://issues.apache.org/jira/browse/MAHOUT-1748
 Project: Mahout
  Issue Type: Task
  Components: Math
Affects Versions: 0.10.2
Reporter: Alexey Grigorev
Priority: Minor

 In Flink-Mahout (MAHOUT-1570) Flink Java API is used because Scala API caused 
 different strange compilation problems. 
 But Scala API handles types better than Flink Java API, so it's better to 
 switch to Scala API. It also can solve MAHOUT-1747



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585002#comment-14585002
 ] 

Sebastian Schelter commented on MAHOUT-1739:


The FileItemSimilarity class reads the output of ItemSimilarityJob. You can 
then use the resulting ItemSimilarity with Mahout's recommenders.

[1] 
https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/cf/taste/impl/similarity/file/FileItemSimilarity.java

 maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
 

 Key: MAHOUT-1739
 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.10.0
Reporter: lariven
  Labels: easyfix, patch
 Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch


 the output similar items of ItemSimilarityJob for each target item may exceed 
 the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
 following code of ItemSimilarityJob.java about line NO. 200 may affect:
 if (itemID  otherItemID) {
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 } else {
   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 }
 Don't know why need to switch itemID with otherItemID, but I think a single 
 line is enough:
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584970#comment-14584970
 ] 

Sebastian Schelter commented on MAHOUT-1739:


Actually, this is exactly what we want. All the similarity measures used in 
Mahout are symmetric, so the upper triangular part of the similarity matrix 
already contains all information.

I think I also know where this bug comes from. Its actually not a bug, but 
the parameter maxSimilarItemsPerItem is not named very good.

Lets say maxSimilarItemsPerItem is 10. Now for an item A, we compute the 10 
most similar items. There might be an item B for which A is in its 10 most 
similar items, but B is not in the 10 most similar items of A. In order to 
guarantee that we have 10 most similar items for B, we must output 11 similar 
items for A unfortunately. 

Does that make sense?


 maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
 

 Key: MAHOUT-1739
 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.10.0
Reporter: lariven
  Labels: easyfix, patch
 Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch


 the output similar items of ItemSimilarityJob for each target item may exceed 
 the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
 following code of ItemSimilarityJob.java about line NO. 200 may affect:
 if (itemID  otherItemID) {
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 } else {
   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 }
 Don't know why need to switch itemID with otherItemID, but I think a single 
 line is enough:
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584986#comment-14584986
 ] 

Sebastian Schelter commented on MAHOUT-1739:


We have code that takes this triangular matrix and uses it as an ItemSimilarity 
for our recommenders. In that case, users don't even have to care about the 
internal data presentation.

 maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
 

 Key: MAHOUT-1739
 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.10.0
Reporter: lariven
  Labels: easyfix, patch
 Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch


 the output similar items of ItemSimilarityJob for each target item may exceed 
 the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
 following code of ItemSimilarityJob.java about line NO. 200 may affect:
 if (itemID  otherItemID) {
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 } else {
   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 }
 Don't know why need to switch itemID with otherItemID, but I think a single 
 line is enough:
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1739:
---
Resolution: Not A Problem
Status: Resolved  (was: Patch Available)

 maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
 

 Key: MAHOUT-1739
 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.10.0
Reporter: lariven
  Labels: easyfix, patch
 Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch


 the output similar items of ItemSimilarityJob for each target item may exceed 
 the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
 following code of ItemSimilarityJob.java about line NO. 200 may affect:
 if (itemID  otherItemID) {
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 } else {
   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 }
 Don't know why need to switch itemID with otherItemID, but I think a single 
 line is enough:
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584474#comment-14584474
 ] 

Sebastian Schelter commented on MAHOUT-1739:


Could you supply a unit test that clearly shows that this is not working?

 maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
 

 Key: MAHOUT-1739
 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.10.0
Reporter: lariven
  Labels: easyfix, patch
 Fix For: 0.10.0, 0.10.1


 the output of may exceed the number of similar items we set to this 
 parameter. the following code of ItemSimilarityJob.java about line NO. 200 
 may affect:
 if (itemID  otherItemID) {
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 } else {
   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 }
 Don't know why need to switch itemID with otherItemID, but I think a single 
 line is enough:
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584479#comment-14584479
 ] 

Sebastian Schelter commented on MAHOUT-1739:


Could you supply a unit test that shows a case where the 
maxSimilarItemsPerItems is not correctly handled by the current code?

 maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
 

 Key: MAHOUT-1739
 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.10.0
Reporter: lariven
  Labels: easyfix, patch
 Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch


 the output similar items of ItemSimilarityJob for each target item may exceed 
 the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
 following code of ItemSimilarityJob.java about line NO. 200 may affect:
 if (itemID  otherItemID) {
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 } else {
   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
 DoubleWritable(similarItem.getSimilarity()));
 }
 Don't know why need to switch itemID with otherItemID, but I think a single 
 line is enough:
   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
 DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-09 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1570:
---
Comment: was deleted

(was: I don't think it makes sense to issue pull requests with unfinished code.)

 Adding support for Apache Flink as a backend for the Mahout DSL
 ---

 Key: MAHOUT-1570
 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
 Project: Mahout
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Suneel Marthi
  Labels: DSL, flink, scala

 With the finalized abstraction of the Mahout DSL plans from the backend 
 operations (MAHOUT-1529), it should be possible to integrate further backends 
 for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
 good execution backend. 
 With respect to the implementation, the biggest difference between Spark and 
 Flink at the moment is probably the incremental rollout of plans, which is 
 triggered by Spark's actions and which is not supported by Flink yet. 
 However, the Flink community is working on this issue. For the moment, it 
 should be possible to circumvent this problem by writing intermediate results 
 required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-09 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1570:
---
Comment: was deleted

(was: I don't think it makes sense to issue pull requests with unfinished code.)

 Adding support for Apache Flink as a backend for the Mahout DSL
 ---

 Key: MAHOUT-1570
 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
 Project: Mahout
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Suneel Marthi
  Labels: DSL, flink, scala

 With the finalized abstraction of the Mahout DSL plans from the backend 
 operations (MAHOUT-1529), it should be possible to integrate further backends 
 for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
 good execution backend. 
 With respect to the implementation, the biggest difference between Spark and 
 Flink at the moment is probably the incremental rollout of plans, which is 
 triggered by Spark's actions and which is not supported by Flink yet. 
 However, the Flink community is working on this issue. For the moment, it 
 should be possible to circumvent this problem by writing intermediate results 
 required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-09 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578440#comment-14578440
 ] 

Sebastian Schelter commented on MAHOUT-1570:


I don't think it makes sense to issue pull requests with unfinished code.

 Adding support for Apache Flink as a backend for the Mahout DSL
 ---

 Key: MAHOUT-1570
 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
 Project: Mahout
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Suneel Marthi
  Labels: DSL, flink, scala

 With the finalized abstraction of the Mahout DSL plans from the backend 
 operations (MAHOUT-1529), it should be possible to integrate further backends 
 for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
 good execution backend. 
 With respect to the implementation, the biggest difference between Spark and 
 Flink at the moment is probably the incremental rollout of plans, which is 
 triggered by Spark's actions and which is not supported by Flink yet. 
 However, the Flink community is working on this issue. For the moment, it 
 should be possible to circumvent this problem by writing intermediate results 
 required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-09 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578438#comment-14578438
 ] 

Sebastian Schelter commented on MAHOUT-1570:


I don't think it makes sense to issue pull requests with unfinished code.

 Adding support for Apache Flink as a backend for the Mahout DSL
 ---

 Key: MAHOUT-1570
 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
 Project: Mahout
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Suneel Marthi
  Labels: DSL, flink, scala

 With the finalized abstraction of the Mahout DSL plans from the backend 
 operations (MAHOUT-1529), it should be possible to integrate further backends 
 for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
 good execution backend. 
 With respect to the implementation, the biggest difference between Spark and 
 Flink at the moment is probably the incremental rollout of plans, which is 
 triggered by Spark's actions and which is not supported by Flink yet. 
 However, the Flink community is working on this issue. For the moment, it 
 should be possible to circumvent this problem by writing intermediate results 
 required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-09 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578439#comment-14578439
 ] 

Sebastian Schelter commented on MAHOUT-1570:


I don't think it makes sense to issue pull requests with unfinished code.

 Adding support for Apache Flink as a backend for the Mahout DSL
 ---

 Key: MAHOUT-1570
 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
 Project: Mahout
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Suneel Marthi
  Labels: DSL, flink, scala

 With the finalized abstraction of the Mahout DSL plans from the backend 
 operations (MAHOUT-1529), it should be possible to integrate further backends 
 for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
 good execution backend. 
 With respect to the implementation, the biggest difference between Spark and 
 Flink at the moment is probably the incremental rollout of plans, which is 
 triggered by Spark's actions and which is not supported by Flink yet. 
 However, the Flink community is working on this issue. For the moment, it 
 should be possible to circumvent this problem by writing intermediate results 
 required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-04-30 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521581#comment-14521581
 ] 

Sebastian Schelter commented on MAHOUT-1570:


great to see this finally happening

 Adding support for Apache Flink as a backend for the Mahout DSL
 ---

 Key: MAHOUT-1570
 URL: https://issues.apache.org/jira/browse/MAHOUT-1570
 Project: Mahout
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Sebastian Schelter
  Labels: DSL, flink, scala

 With the finalized abstraction of the Mahout DSL plans from the backend 
 operations (MAHOUT-1529), it should be possible to integrate further backends 
 for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
 good execution backend. 
 With respect to the implementation, the biggest difference between Spark and 
 Flink at the moment is probably the incremental rollout of plans, which is 
 triggered by Spark's actions and which is not supported by Flink yet. 
 However, the Flink community is working on this issue. For the moment, it 
 should be possible to circumvent this problem by writing intermediate results 
 required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: co-occurrence paper and code

2014-08-06 Thread Sebastian Schelter

I chose against porting all the similarity measures to the dsl version of
the cooccurrence analysis for two reasons. First, adding the measures in a
generalizable way makes the code superhard to read. Second, in practice, I
have never seen something giving better results than llr. As Ted pointed
out, a lot of the foundations of using similarity measures comes from
wanting to predict ratings, which people never do in practice. I think we
should restrict ourselves to approaches that work with implicit, count-like
data.

-s
Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:

 On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  I suppose in that context LLR is considered a distance (higher scores
 mean
   more `distant` items, co-occurring by chance only)?
  
 
  Self-correction on this one -- having given a quick look at llr paper
  again, it looks like it is actually a similarity (higher scores meaning
  more stable co-occurrences, i.e. it moves in the opposite direction of
   p-value if it had been a classic  test
 

 LLR is a classic test.  It is essentially Pearson's chi^2 test without the
 normal approximation.  See my papers[1][2] introducing the test into
 computational linguistics (which ultimately brought it into all kinds of
 fields including recommendations) and also references for the G^2 test[3].

 [1] http://www.aclweb.org/anthology/J93-1003
 [2] http://arxiv.org/abs/1207.1847
 [3] http://en.wikipedia.org/wiki/G-test

Re: co-occurrence paper and code

2014-08-06 Thread Sebastian Schelter

Sounds good to me.

-s
Am 06.08.2014 17:15 schrieb Dmitriy Lyubimov dlie...@gmail.com:

 what i mean here i probably need to refactor it a little so that there's
 part of algorithm that accepts co-occurrence input directly and which is
 somewhat decoupled from the part that accepts u x item input and does
 downsampling and co-occurrence construction. So i could do some
 customization of my own to co-occurrence construction. Would that be
 reasonable if i do that?


 On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Asking because i am considering pulling this implementation but for some
  (mostly political) reasons people want to try different things here.
 
  I may also have to start with a different way of constructing
  co-occurrences, and may do a few optimizations there (i.e. priority queue
  queing/enqueing does twice the work it really needs to do etc.)
 
 
 
 
  On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter 
  ssc.o...@googlemail.com wrote:
 
  I chose against porting all the similarity measures to the dsl version
 of
  the cooccurrence analysis for two reasons. First, adding the measures
 in a
  generalizable way makes the code superhard to read. Second, in
 practice, I
  have never seen something giving better results than llr. As Ted pointed
  out, a lot of the foundations of using similarity measures comes from
  wanting to predict ratings, which people never do in practice. I think
 we
  should restrict ourselves to approaches that work with implicit,
  count-like
  data.
 
  -s
  Am 06.08.2014 16:58 schrieb Ted Dunning ted.dunn...@gmail.com:
 
   On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
I suppose in that context LLR is considered a distance (higher
 scores
   mean
 more `distant` items, co-occurring by chance only)?

   
Self-correction on this one -- having given a quick look at llr
 paper
again, it looks like it is actually a similarity (higher scores
  meaning
more stable co-occurrences, i.e. it moves in the opposite direction
 of
 p-value if it had been a classic  test
   
  
   LLR is a classic test.  It is essentially Pearson's chi^2 test without
  the
   normal approximation.  See my papers[1][2] introducing the test into
   computational linguistics (which ultimately brought it into all kinds
 of
   fields including recommendations) and also references for the G^2
  test[3].
  
   [1] http://www.aclweb.org/anthology/J93-1003
   [2] http://arxiv.org/abs/1207.1847
   [3] http://en.wikipedia.org/wiki/G-test

Re: Mahout V2

2014-07-05 Thread Sebastian Schelter

Nice. There is even still a huge potential for optimization in the spark
bindings.

-s
Am 05.07.2014 15:21 schrieb Andrew Musselman andrew.mussel...@gmail.com:

 Crazy awesome.

  On Jul 5, 2014, at 4:19 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
  I compared  spark-itemsimilatity to the Hadoop version on sample data
 that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the
 following speedup.
 
  PlatformElapsed Time
  Mahout Hadoop0:20:37
  Mahout Spark0:02:19
 
  This isn’t quite apples to apples because the Spark version does all the
 dictionary management, which is usually two extra jobs tacked on before and
 after the Hadoop job. I’ve done the complete pipeline using Hadoop and
 Spark now and can say that not only is it faster now but the old Hadoop way
 required keeping track of 10x more intermediate data and connecting up many
 more jobs to get the pipeline working. Now it’s just one job. You don’t
 need to worry about ID translation anymore and you get over 10x faster
 completion — this is one of those times when speed meets ease-of-use.

Re: H2O integration - intermediate progress update

2014-06-19 Thread Sebastian Schelter

I share the impression that the tone of conversation has not been very 
welcoming lately, be it intentional or not. I think that we should 
remind ourselves why we are working on open source and try to improve 
our ways of communication.


I think we should try to get as much people as possible together to sit 
on a table and have some face-to-face discussion during a beer or coffee.


--sebastian

On 06/19/2014 07:18 AM, Dmitriy Lyubimov wrote:

On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:


On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:


I did not mean to discourage
sincere search for answers.



The tone of answers has lately been very discouraging for those sincerely
searching for answers.  I think we as a community have a responsibility to
do better about this.  There is no need to be insulting to people asking
honest questions in a civil tone.



Ted, we've been at this already. There have been more arguments than
questions. I am just providing my counter arguments. Do you insist on terms
insulting? Cause this, you know, insulting. You are heading ad hominem
direction again.

Re: cf/couccurence code

2014-06-19 Thread Sebastian Schelter


Hi Anand,

Yes, this should not contain anything spark-specific. +1 for moving it.

--sebastian



On 06/19/2014 08:38 PM, Anand Avati wrote:

Hi Pat and others,
I see that cf/CooccuranceAnalysis.scala is currently under spark. Is there
a specific reason? I see that the code itself is completely spark agnostic.
I tried moving the code under
math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following
trivial patch:

diff --git 
a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
index ee44f90..bd20956 100644
--- a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
+++ b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
@@ -22,7 +22,6 @@ import scalabindings._
  import RLikeOps._
  import drm._
  import RLikeDrmOps._
-import org.apache.mahout.sparkbindings._
  import scala.collection.JavaConversions._
  import org.apache.mahout.math.stats.LogLikelihood


and it seems to work just fine. From what I see, this should work just fine
on H2O as well with no changes.. Why give up generality and make it spark
specific?

Thanks

[jira] [Resolved] (MAHOUT-1580) Optimize getNumNonZeroElements

2014-06-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1580.


Resolution: Fixed

 Optimize getNumNonZeroElements
 --

 Key: MAHOUT-1580
 URL: https://issues.apache.org/jira/browse/MAHOUT-1580
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


 getNumNonZeroElements in AbstractVector uses the nonZeroes -iterator 
 internally which adds a lot of overhead for certain types of vectors, e.g. 
 the dense ones. We should add custom implementations here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Engine specific algos

2014-06-18 Thread Sebastian Schelter

I think rejecting that contribution is the right thing to do. I think 
its very important to narrow our focus. Let us put our efforts into 
finishing and polishing what we are working on right now.


A big problem of the old mahout was that we set the barrier for 
contributions too low and ended up with lots of non-integrated, 
hard-to-use algorithms of varying quality.


What is the problem with not accepting a contribution? We agreed with 
Andy that this might be better suited for inclusion in Spark's codebase 
and I think that was the right decision.


-s

On 06/18/2014 10:29 PM, Pat Ferrel wrote:

Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random 
forests


Also, we don't have any mappings for Spark Streaming -- so if your
implementation heavily relies on Spark streaming, i think Spark itself is
the right place for it to be a part of.


We are discouraging engine specific work? Even dismissing Spark Streaming as a 
whole?


As it stands we don't have purely (c) methods and indeed i believe these
methods may be totally engine-specific in which case mllib is one of
possibly good homes for them.


Adherence to a specific incarnation of an engine-neutral DSL has become a 
requirement for inclusion in Mahout? The current DSL cannot be extended? Or it 
can’t be extended with engine specific ways? Or it can’t be extended with Spark 
Streaming? I would have thought all of these things desirable otherwise we are 
limiting ourselves to a subset of what an engine can do or a subset of problems 
that the current DSL supports.

I hope I’m misreading this but it looks like we just discourage a contributor 
from adding post hadoop code in an interesting area to Mahout?

[jira] [Commented] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-13 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030429#comment-14030429
 ] 

Sebastian Schelter commented on MAHOUT-1579:


Xiaomeng, could you create a pull request to https://github.com/apache/mahout 
on github? That would make it easier to review your code. 

 Implement a datamodel which can load data from hadoop filesystem directly
 -

 Key: MAHOUT-1579
 URL: https://issues.apache.org/jira/browse/MAHOUT-1579
 Project: Mahout
  Issue Type: Improvement
Reporter: Xiaomeng Huang
Priority: Minor
 Attachments: Mahout-1579.patch


 As we all know, FileDataModel can only load data from local filesystem.
 But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
 If we want to deal with the data in hdfs, we must run mapred job. 
 It's necessay to implement a data model which can load data from hadoop 
 filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAHOUT-1580) Optimize getNumNonZeroElements

2014-06-13 Thread Sebastian Schelter (JIRA)

Sebastian Schelter created MAHOUT-1580:
--

 Summary: Optimize getNumNonZeroElements
 Key: MAHOUT-1580
 URL: https://issues.apache.org/jira/browse/MAHOUT-1580
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


getNumNonZeroElements in AbstractVector uses the nonZeroes -iterator internally 
which adds a lot of overhead for certain types of vectors, e.g. the dense ones. 
We should add custom implementations here.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter

I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The 
code in AbstractVector clearly returns the non-zeros only:


int count = 0;
IteratorElement it = iterateNonZero();
while (it.hasNext()) {
  if (it.next().get() != 0.0) {
count++;
  }
}
return count;

On the other hand, the internal code seems broken here, why does 
iterateNonZero potentially return 0's?


--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029345#comment-14029345
 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

 https://github.com/apache/mahout/pull/12#issuecomment-45915940

 fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs 
on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be 
used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
several applications including cross-action recommendations.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAHOUT-1578) Optimizations in matrix serialization

2014-06-11 Thread Sebastian Schelter (JIRA)

Sebastian Schelter created MAHOUT-1578:
--

 Summary: Optimizations in matrix serialization
 Key: MAHOUT-1578
 URL: https://issues.apache.org/jira/browse/MAHOUT-1578
 Project: Mahout
  Issue Type: Bug
  Components: Math
Reporter: Sebastian Schelter
 Fix For: 1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028360#comment-14028360
 ] 

Sebastian Schelter commented on MAHOUT-1464:


Hi,

The computation of A'A is usually done without explicitly forming A'. 
Instead A'A is computed as the sum of outer products of rows of A.

--sebastian




 Cooccurrence Analysis on Spark
 --

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
 MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh


 Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
 runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
 can be used as input. 
 Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
 several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter

Oh good catch! I had an extra binarize method before, so that the data 
was already binary. I merged that into the downsample code and must have 
overlooked that thing. You are right, numNonZeros is the way to go!



On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) j...@apache.org wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex - 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() = math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Sebastian Schelter


Hi Sahil,

don't worry, you're not breaking any rules. We removed the tree-based 
recommenders because we have never heard of anyone using them over the 
years.


--sebastian

On 06/10/2014 09:01 AM, Sahil Sharma wrote:

Hi,

Firstly I apologize if I'm breaking certain rules by mailing this way, I'm
new to this and would appreciate any help I could get.

I was just playing around with the tree-based Recommender ( which seems to
be deprecated in the current version for the lack of use ) .

Why was it deprecated?

Also, I just looked at the code, and it seems to be doing a lot of
redundant computations, for example we could store a matrix of
cluster-cluster distances ( and hence avoid recomputing the closest
clusters every time by updating the matrix whenever we merge two clusters)
and also , when trying to determine the farthest distance based similarity
between two clusters again the pair which realizes this could be stored ,
and updated upon merging so that this computation need not to repeated
again and again.

Just wondering if this repeated computation was not a reason for
deprecating the class ( since people might have found a slow recommender
lacking use ) .

Would be glad to hear the thoughts of others on this, and also implement an
efficient version if the community agrees.

SparkBindings on a real cluster

2014-06-04 Thread Sebastian Schelter


Hi,

I did some experimentation with the spark bindings on a real cluster 
yesterday, as I had to run some experiments for a paper (unrelated to 
Mahout) that I'm currently writing. The experiment basically consists of 
multiplying a sparse data matrix by a super-sparse permutation-like 
matrix from the left. It took me the whole day to get it working, up to 
matrices with 500M entries.


I ran into lots of issues that we have to fix asap, unfortunately I 
don't have much time in the next weeks, so I'm just sharing a list of 
the issues that I ran into (maybe I'll find some time to create issues 
for these things on the weekend).


I think the major challenge for us will be to get choice of dense/sparse 
correct and put lots of work into memory efficiency. This could be a 
great hook for collaborating with the h20 folks, as they know how to 
make vector-like data small and computations fast.


Here's the list:

* our matrix serialization in MatrixWritable is seriously flawed, I ran 
into the following errors


  - the type information is stored with every vector although a matrix 
always only contains vectors of the same type
  - all entries of a TransposeView (and possibly other views) of a 
sparse matrix are serialized, resulting in OOM
  - for sparse row matrices, the vectors are set using assign instead 
of via constructor injection, this results in huge memory consumption 
and long creation times, as in some implementations, binary search is 
used for assignment


* a dense matrix is converted into a SparseRowMatrix with dense row 
vectors by blockify(), after serialization this becomes a dense matrix 
in sparse format (triggering OOMs)!


* drmFromHDFS does not have an option to set the number of desired 
partitions


* SparseRowMatrix with sequential vectors times SparseRowMatrix with 
sequential vectors is totally broken, it uses three nested loops and 
uses get(row, col) on the matrices, which internally uses binary search...


* At operator adds the column vectors it creates, this is unnecessary as 
we don't need the addition, we can just merge the vectors


* we need a dedicated operator for inCoreA %*% drmB, currently this gets 
rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I 
have a prototype of that operator)


Best,
Sebastian

Contributions coming

2014-06-04 Thread Sebastian Schelter


Hi,

as you know we still have a lot of open documenttion tickets.

Therefore, I decided to offer these tickets as projects to students in a 
university lecture that I'm giving with some colleagues:


MAHOUT-1495
MAHOUT-1470
MAHOUT-1462
MAHOUT-1477
MAHOUT-1485
MAHOUT-1423
MAHOUT-1427
MAHOUT-1536
MAHOUT-1551
MAHOUT-1493

In the next weeks, the students will join the mailinglist and start 
working on the documentation and examples. Let's give them a warm 
welcome and help them learn how to produce open source software.


Best,
Sebastian

Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Sebastian Schelter

The important thing here is that we test the code on a sufficiently large
dataset on a real cluster. Take that on, if you want!
Am 02.06.2014 20:08 schrieb Pat Ferrel (JIRA) j...@apache.org:


 [
 https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015667#comment-14015667
 ]

 Pat Ferrel commented on MAHOUT-1464:
 

 [~ssc] Should I reassign to me for now so we can get this committed?

  Cooccurrence Analysis on Spark
  --
 
  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Sebastian Schelter
  Fix For: 1.0
 
  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
 MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
 run-spark-xrsj.sh
 
 
  Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
 that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
 a DRM can be used as input.
  Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
 has several applications including cross-action recommendations.



 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)

Re: mlib versus spark

2014-06-01 Thread Sebastian Schelter


Hi Saikat,

The differences are that MLLib offers a different set of algorithms 
(e.g. you want find cooccurrence analysis or stochastic svd) and that 
their codebase consists of hand-tuned, spark-specific implementations.


Mahout on the other hand, allows to implement algorithms in an 
engine-agnostic, declarative way. This allows for the automatic 
optimization of our algorithms as well as for running the same code on 
multiple backends (there has been interested from h20 as well as Apache 
Flink to integrate with our DSL).


--sebastian

On 06/01/2014 01:41 AM, Saikat Kanjilal wrote:

Actually the subject of my email should say spark-mlib versus mahout-spark :)


From: sxk1...@hotmail.com
To: dev@mahout.apache.org
Subject: mlib versus spark
Date: Sat, 31 May 2014 16:38:13 -0700

Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit 
confused when I think of mahout using spark, since spark already uses an 
embedded machine learning library (mlib) what would be the impetus to use 
mahout instead, seems like you should be able to write or add algortihms to 
mlib and use spark, has someone from mahout looked at mlib to see if there will 
be a strongusecase for using one versus the other?
http://spark.apache.org/mllib/

[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-06-01 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014918#comment-14014918
 ] 

Sebastian Schelter commented on MAHOUT-1566:


If its a mere showcase, could we maybe add it as an example in an example 
package, not a full fledged algorithm implementation somehow?

 Regular ALS factorizer with convergence test.
 -

 Key: MAHOUT-1566
 URL: https://issues.apache.org/jira/browse/MAHOUT-1566
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Priority: Trivial
 Fix For: 1.0


 ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Problems with mapBlock()

2014-06-01 Thread Sebastian Schelter

I've updated the codebase to work on the cooccurrence analysis algo, but 
I always run into this error now:


error: value mapBlock is not a member of 
org.apache.mahout.math.drm.DrmLike[Int]


I have the feeling that an implicit conversion might be missing, but I 
couldn't figure out where to put it, with out producing even more errors.


--sebastian

[jira] [Updated] (MAHOUT-1524) Script to auto-generate and view the Mahout website on a local machine

2014-05-31 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1524:
---

Fix Version/s: 1.0

 Script to auto-generate and view the Mahout website on a local machine 
 ---

 Key: MAHOUT-1524
 URL: https://issues.apache.org/jira/browse/MAHOUT-1524
 Project: Mahout
  Issue Type: New Feature
  Components: Documentation
Reporter: Saleem Ansari
 Fix For: 1.0

 Attachments: mahout-website.sh


 Attached with this ticket is a script that creates a simple setup for editing 
 Mahout Website on a local machine.
 It is useful in the sense that, we can edit the source and the changes are 
 automatically reflected in the generated site. All we need to do is refresh 
 the browser. No further steps required.
 So now one can review the website changes ( the complete website ), on a 
 developer's machine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1551) Add document to describe how to use mlp with command line

2014-05-31 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1551:
---

Fix Version/s: 1.0

 Add document to describe how to use mlp with command line
 -

 Key: MAHOUT-1551
 URL: https://issues.apache.org/jira/browse/MAHOUT-1551
 Project: Mahout
  Issue Type: Documentation
  Components: Classification, CLI, Documentation
Affects Versions: 0.9
Reporter: Yexi Jiang
  Labels: documentation
 Fix For: 1.0


 Add documentation about the usage of multi-layer perceptron in command line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1552:
---

Fix Version/s: 1.0

 Avoid new Configuration() instantiation
 ---

 Key: MAHOUT-1552
 URL: https://issues.apache.org/jira/browse/MAHOUT-1552
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7
 Environment: CDH 4.4, CDH 4.6
Reporter: Sergey
 Fix For: 1.0


 Hi, it's related to MAHOUT-1498
 You get troubles when run mahout stuff from oozie java action.
 {code}
 ava.lang.InterruptedException: Cluster Classification Driver Job failed 
 processing /tmp/sku/tfidf/90453
   at 
 org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
   at 
 org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-31 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014570#comment-14014570
 ] 

Sebastian Schelter commented on MAHOUT-1543:


Could you create a pull request to the current mahout codebase?

 JSON output format for classifying with random forests
 --

 Key: MAHOUT-1543
 URL: https://issues.apache.org/jira/browse/MAHOUT-1543
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.7, 0.8, 0.9
Reporter: larryhu
  Labels: patch
 Fix For: 0.7

 Attachments: MAHOUT-1543.patch


 This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014571#comment-14014571
 ] 

Sebastian Schelter commented on MAHOUT-1552:


Could you suggest a way to fix the bug?

 Avoid new Configuration() instantiation
 ---

 Key: MAHOUT-1552
 URL: https://issues.apache.org/jira/browse/MAHOUT-1552
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7
 Environment: CDH 4.4, CDH 4.6
Reporter: Sergey
 Fix For: 1.0


 Hi, it's related to MAHOUT-1498
 You get troubles when run mahout stuff from oozie java action.
 {code}
 ava.lang.InterruptedException: Cluster Classification Driver Job failed 
 processing /tmp/sku/tfidf/90453
   at 
 org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
   at 
 org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents

2014-05-31 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014572#comment-14014572
 ] 

Sebastian Schelter commented on MAHOUT-1564:


I don't see any reason to veto this, as it will make stuff that we have more 
useful.

 Naive Bayes Classifier for New Text Documents
 -

 Key: MAHOUT-1564
 URL: https://issues.apache.org/jira/browse/MAHOUT-1564
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Andrew Palumbo
 Fix For: 1.0


 MapReduce Naive Bayes implementation currently lacks the ability to classify 
 a new document (outside of the training/holdout corpus).  I've begun some 
 work on a ClassifyNew job which will do the following:
 1. Vectorize a new text document using the dictionary and document 
 frequencies from the training/holdout corpus 
 - assume the original corpus was vectorized using `seq2sparse`; step (1) 
 will use all of the same parameters. 
 2. Score and label a new document using a previously trained model.
 I think that it will be a useful addition to the NB package.  Unfortunately, 
 this is going to be mostly MR workhorse code and doesn't really introduce 
 much new logic. I will try to keep any new logic separate from MR code so 
 that it can be called from scala for MAHOUT-1493.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-31 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1565:
---

Fix Version/s: 1.0

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
 Fix For: 1.0

 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014573#comment-14014573
 ] 

Sebastian Schelter commented on MAHOUT-1566:


I'm not sure whether we should really include the standard ALS in the new 
codebase. It is optimized for rating prediction on Netflix-like data which 
rarely exists outside of academia. I think we should rather focus on the ALS 
version targeted for implicit data (clicks, views, etc).

 Regular ALS factorizer with convergence test.
 -

 Key: MAHOUT-1566
 URL: https://issues.apache.org/jira/browse/MAHOUT-1566
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.9
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
Priority: Trivial
 Fix For: 1.0


 ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-29 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012136#comment-14012136
 ] 

Sebastian Schelter commented on MAHOUT-1565:


I'd favor removing that

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-27 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010124#comment-14010124
 ] 

Sebastian Schelter commented on MAHOUT-1529:


Hi Dmitriy,

the PR looks good, +1 from me, go ahead!

Best,
Sebastian




 Finalize abstraction of distributed logical plans from backend operations
 -

 Key: MAHOUT-1529
 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
 Project: Mahout
  Issue Type: Improvement
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 We have a few situations when algorithm-facing API has Spark dependencies 
 creeping in. 
 In particular, we know of the following cases:
 -(1) checkpoint() accepts Spark constant StorageLevel directly;-
 (2) certain things in CheckpointedDRM;
 (3) drmParallelize etc. routines in the drm and sparkbindings package. 
 (5) drmBroadcast returns a Spark-specific Broadcast object
 *Current tracker:* 
 https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
 *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1536) Update Creating vectors from text page

2014-05-25 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008319#comment-14008319
 ] 

Sebastian Schelter commented on MAHOUT-1536:


Added the changes. Can someone have a look at the lucene part of the site? We 
should post the currently used lucene version there and not require users to 
look into the POM for example.

 Update Creating vectors from text page
 

 Key: MAHOUT-1536
 URL: https://issues.apache.org/jira/browse/MAHOUT-1536
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9
Reporter: Andrew Palumbo
Priority: Minor
 Fix For: 1.0

 Attachments: MAHOUT-1536_edit1.patch, MAHOUT-1536_edit2.patch


 At least the seq2sparse section of the Creating vectors from text page is 
 out of date.  
 https://mahout.apache.org/users/basics/creating-vectors-from-text.html
   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1480) Clean up website on 20 newsgroups

2014-05-25 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1480:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed, thank you very much

 Clean up website on 20 newsgroups
 -

 Key: MAHOUT-1480
 URL: https://issues.apache.org/jira/browse/MAHOUT-1480
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1480_edit1.patch, MAHOUT-1480_edit2.patch


 The website on the twenty newsgroups example needs clean up. We need to go 
 through the text, remove dead links and check whether the information is 
 still consistent with the current code.
 https://mahout.apache.org/users/clustering/twenty-newsgroups.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1446) Create an intro for matrix factorization

2014-05-25 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1446.


Resolution: Fixed
  Assignee: Sebastian Schelter

Jian, thank you very much, you did a great job. I put the page online, could 
you have a look at it?

Thx,
Sebastian

 Create an intro for matrix factorization
 

 Key: MAHOUT-1446
 URL: https://issues.apache.org/jira/browse/MAHOUT-1446
 Project: Mahout
  Issue Type: New Feature
  Components: Documentation
Reporter: Maciej Mazur
Assignee: Sebastian Schelter
 Fix For: 1.0

 Attachments: matrix-factorization.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1560) Last batch is not filled correctly in MultithreadedBatchItemSimilarities

2014-05-24 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1560.


   Resolution: Fixed
Fix Version/s: 1.0
 Assignee: Sebastian Schelter

committed, thank you for the contribution

 Last batch is not filled correctly in MultithreadedBatchItemSimilarities
 

 Key: MAHOUT-1560
 URL: https://issues.apache.org/jira/browse/MAHOUT-1560
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Jarosław Bojar
Assignee: Sebastian Schelter
Priority: Minor
 Fix For: 1.0

 Attachments: Corrected_last_batch_size_calculation.patch, 
 MultithreadedBatchItemSimilaritiesTest.patch


 In {{MultithreadedBatchItemSimilarities}} method {{queueItemIDsInBatches}} 
 handles last batch incorrectly. Last batch length is calculated incorrectly. 
 As a result last batch is either truncated or too long with superfluous 
 positions filled with item indexes from previous batch (or zeros if it is 
 also the first batch as in attached test).
 Attached test fails for very short model (with only 4 items) with 
 NoSuchItemException.
 Attached patch corrects this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1558) Clean up classify-wiki.sh and add in a binary classification problem

2014-05-24 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1558:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed, thank you for your great work

 Clean up classify-wiki.sh and add in a binary classification problem  
 --

 Key: MAHOUT-1558
 URL: https://issues.apache.org/jira/browse/MAHOUT-1558
 Project: Mahout
  Issue Type: Improvement
  Components: Classification, Examples
Affects Versions: 1.0
Reporter: Andrew Palumbo
Assignee: Sebastian Schelter
Priority: Minor
 Fix For: 1.0

 Attachments: MAHOUT-1558.patch


 Some minor cleanups to classify-wiki.sh.   Added in a 2 class problem: United 
 States and United Kingdom.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1561) cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true

2014-05-24 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1561:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed, thank you very much

 cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true
 --

 Key: MAHOUT-1561
 URL: https://issues.apache.org/jira/browse/MAHOUT-1561
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Examples
Affects Versions: 0.9
Reporter: Andrew Palumbo
Assignee: Sebastian Schelter
Priority: Minor
 Fix For: 1.0

 Attachments: MAHOUT-1561.patch


 cluster-syntheticcontrol.sh is not running locally with MAHOUT_LOCAL set.  
 Patch adds a check for MAHOUT_LOCAL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Hadoop 2 support in a real release?

2014-05-23 Thread Sebastian Schelter

Big +1
Am 23.05.2014 15:33 schrieb Ted Dunning ted.dunn...@gmail.com:

 What do folks think about spinning out a new version of 0.9 that only
 changes which version of Hadoop the build uses?

 There have been quite a few questions lately on this topic.

 My suggestion would be that we use minor version numbering to maintain this
 and the normal 0.9 release simultaneously if we decide to do a bug fix
 release.

 Any thoughts?

[jira] [Commented] (MAHOUT-1557) Add support for sparse training vectors in MLP

2014-05-23 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007235#comment-14007235
 ] 

Sebastian Schelter commented on MAHOUT-1557:


Karol, your patch contains some errors, e.g. the variable position is set but 
never read in RunMultilayerPerceptron.

Furthermore, NeuralNetwork converts the input to a DenseVector internally in 
getOutput(), so you also have to modify that code.

 Add support for sparse training vectors in MLP
 --

 Key: MAHOUT-1557
 URL: https://issues.apache.org/jira/browse/MAHOUT-1557
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Karol Grzegorczyk
Priority: Minor
  Labels: mlp
 Fix For: 1.0

 Attachments: mlp_sparse.diff


 When the number of input units of MLP is big, it is likely that input vector 
 will be sparse. It should be possible to read input files in a sparse format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1555) Exception thrown when a test example has the label not present in training examples

2014-05-23 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007243#comment-14007243
 ] 

Sebastian Schelter commented on MAHOUT-1555:


Hi Karol,

Could you update the patch to at least log a warning in such a case?



 Exception thrown when a test example has the label not present in training 
 examples
 ---

 Key: MAHOUT-1555
 URL: https://issues.apache.org/jira/browse/MAHOUT-1555
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 1.0
Reporter: Karol Grzegorczyk
Priority: Minor
 Fix For: 1.0

 Attachments: test_label_not_present_in_training_examples.diff


 Currently an IllegalArgumentException is thrown when a test example has the 
 label (belongs to the class) not present in training examples. When the 
 number of labels is big, such a situation is likely and valid. The example of 
 course will be misclassified, but exception should not be thrown. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1554) Provide more comprehensive classification statistics

2014-05-23 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1554:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed with a few cosmetic changes, thank you for the contribution

 Provide more comprehensive classification statistics
 

 Key: MAHOUT-1554
 URL: https://issues.apache.org/jira/browse/MAHOUT-1554
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Karol Grzegorczyk
Priority: Minor
 Fix For: 1.0

 Attachments: statistics.diff


 Currently only limited classification statistics are provided. To better 
 understand classification results, it would be worth to provide at lease 
 average precision, recall and F1 score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action

2014-05-23 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1553:
---

Resolution: Not a Problem
Status: Resolved  (was: Patch Available)

closing this, as suneel said its already fixed

 Fix for run Mahout stuff as oozie java action
 -

 Key: MAHOUT-1553
 URL: https://issues.apache.org/jira/browse/MAHOUT-1553
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.7
 Environment: mahout-core-0.7-cdh4.4.0.jar
Reporter: Sergey
 Attachments: MAHOUT-1553.patch


 Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files 
 property is not correctly pushed down to Mahout MR stuff because of new 
 Configuration usage
 at 
 org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
   at 
 org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at 
 org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-22 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005681#comment-14005681
 ] 

Sebastian Schelter commented on MAHOUT-1534:


Looks good, I think we should also mention the skipTests option for packaging 
and add a news entry for that.

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Gokhan Capan
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1556) Mahout for Hadoop2 - HDP2.1.1

2014-05-22 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005720#comment-14005720
 ] 

Sebastian Schelter commented on MAHOUT-1556:


You have to use the trunk version, 0.9 does not have the support for Hadoop 2 
yet.

This page has infos on how to build mahout for Hadoop 2: 
https://mahout.apache.org/developers/buildingmahout.html

Let us know if that doesn't work for you.

 Mahout for Hadoop2 - HDP2.1.1
 -

 Key: MAHOUT-1556
 URL: https://issues.apache.org/jira/browse/MAHOUT-1556
 Project: Mahout
  Issue Type: Dependency upgrade
  Components: Integration
Affects Versions: 0.9
 Environment: Ubuntu 12.04, Centos6, Java Oracle 1.7
Reporter: Prabhat K Singh
  Labels: hadoop2
 Fix For: 0.9


 Hi, 
 I tried build and install of Mahout0.9 for hadoop HDP2.1.1 as per given 
 methods in https://issues.apache.org/jira/browse/MAHOUT-1329, but I get 
 errors as mentioned below.
 Method:
 mvn clean package  -Dhadoop.profile=200  -Dhadoop2.version=2.2.0 
 -Dhbase.version=0.98
 mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0
 mvn clean package -Dhadoop2 -Dhadoop.profile=200  -Dhadoop2.version=2.4.0 
 -Dhbase.version=0.98
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project mahout-integration: Compilation failure: Compilation failure:
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[30,31]
  cannot find symbol
 [ERROR] symbol:   class HBaseConfiguration
 [ERROR] location: package org.apache.hadoop.hbase
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[33,31]
  cannot find symbol
 [ERROR] symbol:   class KeyValue
 [ERROR] location: package org.apache.hadoop.hbase
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[47,36]
  cannot find symbol
 [ERROR] symbol:   class Bytes
 [ERROR] location: package org.apache.hadoop.hbase.util
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[91,42]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[92,42]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[107,26]
  cannot find symbol
 [ERROR] symbol:   variable HBaseConfiguration
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[138,51]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[206,26]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[207,25]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[233,15]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[265,26]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR] 
 /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[266,25]
  cannot find symbol
 [ERROR] symbol:   variable Bytes
 [ERROR] location: class 
 org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
 [ERROR

[jira] [Updated] (MAHOUT-1554) Provide more comprehensive classification statistics

2014-05-21 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1554:
---

Fix Version/s: 1.0

 Provide more comprehensive classification statistics
 

 Key: MAHOUT-1554
 URL: https://issues.apache.org/jira/browse/MAHOUT-1554
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Karol Grzegorczyk
Priority: Minor
 Fix For: 1.0

 Attachments: statistics.diff


 Currently only limited classification statistics are provided. To better 
 understand classification results, it would be worth to provide at lease 
 average precision, recall and F1 score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: consensus statement?

2014-05-21 Thread Sebastian Schelter

Big +1, very nicely captures what I also think

--sebastian
Am 21.05.2014 14:27 schrieb Gokhan Capan gkhn...@gmail.com:

 I want to express my opinions for the vision, too. I tried to capture those
 words from various discussions in the dev-list, and hope that most, of them
 support the common sense of excitement the new Mahout arouses

 To me, the fundamental benefit of the shift that Mahout is undergoing is a
 better separation of the distributed execution engine, distributed data
 structures, matrix computations, and algorithms layers, which will allow
 the users/devs of Mahout with different roles focus on the relevant parts
 of the framework:

1. A machine learning scientist, independent from the underlying
distributed execution engine, can utilize the matrix language and the
decompositions to implement new algorithms (which implies that the
 current
distributed mahout algorithms are to be rewritten in the matrix
 language)
2. A math-scala module contributor, for the benefit of higher level
algorithms, can add new, or improve existing functions (the set of
decompositions is an example) with optimization plans (such as if two
matrices are partitioned in the same way, ...), where the concrete
implementations of those optimizations are delegated to the distributed
execution engine layer
3. A distributed execution engine author can add machine learning
capabilities to her platform with i)concrete Matrix and Matrix I/O
implementation  ii)partitioning, checkpointing, broadcasting behaviors,
iii)BLAS
4. A Mahout user with access to a cluster operated by a
Mahout-supporting distributed execution engine can run machine learning
algorithms implemented on top of the matrix language

 Best

 Gokhan


 On Tue, May 20, 2014 at 8:30 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  inline
 
 
  On Tue, May 20, 2014 at 12:42 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  
  
   Let's take the next from our homepage as starting point. What should we
   add/remove/modify?
  
   
   
   The Mahout community decided to move its codebase onto modern data
   processing systems that offer a richer programming model and more
  efficient
   execution than Hadoop MapReduce. Mahout will therefore reject new
  MapReduce
   algorithm implementations from now on. We will however keep our widely
  used
   MapReduce algorithms in the codebase and maintain them.
  
   We are building our future implementations on top of a
 
  Scala
 
   DSL for linear algebraic operations which has been developed over the
  last
   months. Programs written in this DSL are automatically optimized and
   executed in parallel for Apache Spark.
 
  More platforms to be added in the future.
 
  
   Furthermore, there is an experimental contribution undergoing which
 aims
   to integrate the h20 platform into Mahout.

[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004958#comment-14004958
 ] 

Sebastian Schelter commented on MAHOUT-1534:


I somehow cannot see the staged version unfortunately. Just publish it and I'll 
have a look. Maybe we should we even add an extra page and navigation point for 
that site, what do you think?

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: consensus statement?

2014-05-20 Thread Sebastian Schelter


On 05/18/2014 09:28 PM, Ted Dunning wrote:

On Sun, May 18, 2014 at 11:33 AM, Sebastian Schelter s...@apache.org wrote:


I suggest we start with a specific draft that someone prepares (maybe Ted
as he started the thread)



This is a good strategy, and I am happy to start the discussion, but I
wonder if it might help build consensus if somebody else started the ball
rolling.



Let's take the next from our homepage as starting point. What should we 
add/remove/modify?



The Mahout community decided to move its codebase onto modern data 
processing systems that offer a richer programming model and more 
efficient execution than Hadoop MapReduce. Mahout will therefore reject 
new MapReduce algorithm implementations from now on. We will however 
keep our widely used MapReduce algorithms in the codebase and maintain them.


We are building our future implementations on top of a DSL for linear 
algebraic operations which has been developed over the last months. 
Programs written in this DSL are automatically optimized and executed in 
parallel on Apache Spark.


Furthermore, there is an experimental contribution undergoing which aims 
to integrate the h20 platform into Mahout.

Re: [jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-19 Thread Sebastian Schelter

Can you create in an svn compatible way and check that it works with the
current trunk?

Thx, sebastian
Am 19.05.2014 12:01 schrieb larryhu (JIRA) j...@apache.org:


 [
 https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001554#comment-14001554]

 larryhu commented on MAHOUT-1543:
 -

 I'm so sorry for your trouble, this patch is created by git, I clone it
 from github. tag: mahout-0.7.

  JSON output format for classifying with random forests
  --
 
  Key: MAHOUT-1543
  URL: https://issues.apache.org/jira/browse/MAHOUT-1543
  Project: Mahout
   Issue Type: Improvement
   Components: Classification
 Affects Versions: 0.7, 0.8, 0.9
 Reporter: larryhu
   Labels: patch
  Fix For: 0.7
 
  Attachments: MAHOUT-1543.patch
 
 
  This patch adds JSON output format to build random forests,



 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)

[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-19 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002271#comment-14002271
 ] 

Sebastian Schelter commented on MAHOUT-1542:


No, go ahead, thats a great idea.

 Tutorial for playing with Mahout's Spark shell
 --

 Key: MAHOUT-1542
 URL: https://issues.apache.org/jira/browse/MAHOUT-1542
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation, Math
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


 I have a created a tutorial for setting up the spark shell and implementing a 
 simple linear regression algorithm. I'd love to make this part of the 
 website, could someone give it a review?
 https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
 PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
 to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1439) Update talks on Mahout

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001013#comment-14001013
 ] 

Sebastian Schelter commented on MAHOUT-1439:


@tdunning [~dlyubimov] could you add your talks from last month?

 Update talks on Mahout
 --

 Key: MAHOUT-1439
 URL: https://issues.apache.org/jira/browse/MAHOUT-1439
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0


 The talks listed on our homepage seem to end somewhere in 2012.
 I know that there have been tons of other talks on Mahout since then, I've 
 added mine already. It would be great if everybody who knows of additional 
 talks would paste them here, so I can add them to the website.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1521) lucene2seq - Error trying to load data from stored field (when non-indexed)

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001026#comment-14001026
 ] 

Sebastian Schelter commented on MAHOUT-1521:


[~frankscholten] what's the status here?

 lucene2seq - Error trying to  load data from stored field (when non-indexed)
 

 Key: MAHOUT-1521
 URL: https://issues.apache.org/jira/browse/MAHOUT-1521
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.9
Reporter: Terry Blankers
Assignee: Frank Scholten
  Labels: lucene2seq
 Fix For: 1.0


 When using lucene2seq to load data from a field that is stored but not 
 indexed I receive the following error:
 {noformat}IllegalArgumentException: Field 'body' does not exist in the 
 index{noformat}
 Field is described in schema.xml as:
 {noformat}fieldname=bodytype=string stored=true 
 indexed=false/{noformat}
 BTW,  field is copied to 'content' field for searching, schema.xml snippet:
 {noformat}copyField source=body dest=content /{noformat}
 Copy field is described in schema.xml as:
 {noformat}fieldname=content type=text stored=false indexed=true 
 multiValued=true/{noformat}
 If I try to load data from the copy field, lucene2seq runs with no errors but 
 I receive empty data for each key/doc:
 {noformat}Key class: class org.apache.hadoop.io.Text Value Class: class 
 org.apache.hadoop.io.Text
 Key: 96C4C76CF9D7449C724CA77CB8F650EAFD33E31C: Value:
 Key: D6842B81B8D09733B50BEDB4767C2A5C49E43B20: Value:{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1153) Implement streaming random forests

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1153.


Resolution: Won't Fix

no activity for more than a month

 Implement streaming random forests
 --

 Key: MAHOUT-1153
 URL: https://issues.apache.org/jira/browse/MAHOUT-1153
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Reporter: Andy Twigg
  Labels: features
 Fix For: 1.0


 The current random forest implementations are in-core and not scalable. This 
 issue is to add an out-of-core, scalable, streaming implementation. Initially 
 it could be based on [1], and using mappers in a master-worker style.
 [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1544) make Mahout DSL shell depend dynamically on Spark

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001029#comment-14001029
 ] 

Sebastian Schelter commented on MAHOUT-1544:


[~avati] What's the status here?

 make Mahout DSL shell depend dynamically on Spark
 -

 Key: MAHOUT-1544
 URL: https://issues.apache.org/jira/browse/MAHOUT-1544
 Project: Mahout
  Issue Type: Improvement
Reporter: Anand Avati
 Fix For: 1.0

 Attachments: 0001-spark-shell-rename-to-shell.patch, 
 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 
 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch, 
 0002-shell-make-dependency-on-Spark-optional-and-dynamic.patch


 Today the Mahout's scala shell depends on spark.
 Create a cleaner separation between the shell and Spark. For e.g, the in core 
 scalabindings and operators do not need Spark. So make Spark a runtime 
 addon to the shell. Similarly in the future new distributed backend engines 
 can transparently (dynamically) be available through the DSL shell.
 The new shell works, looks and feels exactly like the shell before, but has a 
 cleaner modular architecture.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001014#comment-14001014
 ] 

Sebastian Schelter commented on MAHOUT-1498:


[~serega_sheypak] whats the status here?

 DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
 using oozie
 -

 Key: MAHOUT-1498
 URL: https://issues.apache.org/jira/browse/MAHOUT-1498
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.7
 Environment: mahout-core-0.7-cdh4.4.0.jar
Reporter: Sergey
 Fix For: 1.0


 Hi, I get exception 
 {code}
  Invocation of Main class completed 
 Failing Oozie Launcher, Main class 
 [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
 exception, Job failed!
 java.lang.IllegalStateException: Job failed!
 at 
 org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
 at 
 org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
 at 
 org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
 {code}
 The root cause is:
 {code}
 Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247
 {code}
 Looks like it happens because of 
 DictionaryVectorizer.makePartialVectors method.
 It has code:
 {code}
 DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
 {code}
 which overrides jars pushed with job by oozie:
 {code}
 public static void More ...setCacheFiles(URI[] files, Configuration conf) {
  String sfiles = StringUtils.uriToString(files);
  conf.set(mapred.cache.files, sfiles);
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001020#comment-14001020
 ] 

Sebastian Schelter commented on MAHOUT-1534:


Anybody willing to help here? This is important, as a lot of users keep asking 
about using Mahout with Hadoop2

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1480) Clean up website on 20 newsgroups

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001027#comment-14001027
 ] 

Sebastian Schelter commented on MAHOUT-1480:


[~Andrew_Palumbo] did you have time yet to make the confusion matrix fit?

 Clean up website on 20 newsgroups
 -

 Key: MAHOUT-1480
 URL: https://issues.apache.org/jira/browse/MAHOUT-1480
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1480_edit1.patch


 The website on the twenty newsgroups example needs clean up. We need to go 
 through the text, remove dead links and check whether the information is 
 still consistent with the current code.
 https://mahout.apache.org/users/clustering/twenty-newsgroups.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1532) Add solve() function to the Scala DSL

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1532.


Resolution: Fixed

 Add solve() function to the Scala DSL 
 --

 Key: MAHOUT-1532
 URL: https://issues.apache.org/jira/browse/MAHOUT-1532
 Project: Mahout
  Issue Type: Bug
  Components: Math
Reporter: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1532.patch, MAHOUT-1532.patch


 We should add a solve() function to the Scala DSL with helps with solving Ax 
 = b for in-core matrices and vectors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1514) Contact the original Random Forest author

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1514.


Resolution: Won't Fix

no answer in four weeks

 Contact the original Random Forest author
 -

 Key: MAHOUT-1514
 URL: https://issues.apache.org/jira/browse/MAHOUT-1514
 Project: Mahout
  Issue Type: Task
Reporter: Sebastian Schelter
Priority: Critical
 Fix For: 1.0


 We should contact the original Random Forest author to ask about maintenance 
 of the implementation. Otherwise, this becomes a candidate for removal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1522) Handle logging levels via log4j.xml

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001010#comment-14001010
 ] 

Sebastian Schelter commented on MAHOUT-1522:


[~andrew.musselman] whats the status here?

 Handle logging levels via log4j.xml
 ---

 Key: MAHOUT-1522
 URL: https://issues.apache.org/jira/browse/MAHOUT-1522
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.9
Reporter: Andrew Musselman
Assignee: Andrew Musselman
Priority: Critical
 Fix For: 1.0


 We don't have a properties file to tell log4j what to do, so we inherit other 
 frameworks' settings.
 Suggestion is to add a log4j.xml file in a canonical place and set up logging 
 levels, maybe separating out components for ease of setting levels during 
 debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1252) Add support for Finite State Transducers (FST) as a DictionaryType.

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001022#comment-14001022
 ] 

Sebastian Schelter commented on MAHOUT-1252:


[~drew.farris] what's the status here?

 Add support for Finite State Transducers (FST) as a DictionaryType.
 ---

 Key: MAHOUT-1252
 URL: https://issues.apache.org/jira/browse/MAHOUT-1252
 Project: Mahout
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.7
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0


 Add support for Finite State Transducers (FST) as a DictionaryType, this 
 should result in an order of magnitude speedup of seq2sparse.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1545) Creating holdout sets with seq2sparse and split

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1545.


   Resolution: Later
Fix Version/s: 1.0

Closing this as it is a reminder for things to do in the future.

 Creating holdout sets with seq2sparse and split
 ---

 Key: MAHOUT-1545
 URL: https://issues.apache.org/jira/browse/MAHOUT-1545
 Project: Mahout
  Issue Type: Bug
  Components: Classification, CLI, Examples
Affects Versions: 0.9
Reporter: Andrew Palumbo
 Fix For: 1.0


 The current method for vectorizing data using seq2sparse and then split 
 allows for a large amount of information to spill over from the training sets 
 to the test sets- especially in the case of TF-IDF transformations.  The IDF 
 transform provides alot of information on the holdout set to the training set 
 if calculated previous to splitting them up.  
 I'm not sure if given the current seq2sparse implementation's status as 
 Legacy and the relatively minor advantages that it might give whether or not 
 its worth adding something like a split option to 
 SparseVectorsFromSequenceFiles.java.  But i know that i saw a new 
 implementation being discussed and and think that it would be worth it to 
 have an option like this built in.
 I think that this issue may have been raised before, but i wanted to bring it 
 up again in light of the current move away from MapReduce and the new 
 implementations of Mahout tools that will be coming along. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1470) Topic dump

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001035#comment-14001035
 ] 

Sebastian Schelter commented on MAHOUT-1470:


[~andrew.musselman] what's the status here?

 Topic dump
 --

 Key: MAHOUT-1470
 URL: https://issues.apache.org/jira/browse/MAHOUT-1470
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 1.0
Reporter: Andrew Musselman
Assignee: Andrew Musselman
Priority: Minor
 Fix For: 1.0


 Per 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E
  The script needs to be corrected to not call vectordump for LDA as
  vectordump utility (or even clusterdump) are presently not capable of
  displaying topics and relevant documents. I recall this issue was
  previously reported by Peyman Faratin post 0.9 release.
 
  Mahout's missing a clusterdump utility that reads in LDA
  topics, Document - DocumentId mapping and displays a report of the topics
  and the documents that belong to a topic.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1453) ImplicitFeedbackAlternatingLeastSquaresSolver add support for user supplied confidence functions

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1453.


Resolution: Won't Fix

no activity in four weeks

 ImplicitFeedbackAlternatingLeastSquaresSolver add support for user supplied 
 confidence functions
 

 Key: MAHOUT-1453
 URL: https://issues.apache.org/jira/browse/MAHOUT-1453
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Adam Ilardi
Assignee: Sebastian Schelter
Priority: Minor
  Labels: newbie, patch, performance
 Fix For: 1.0


 double confidence(double rating) {
 return 1 + alpha * rating;
   }
 The original paper mentions other functions that could be used as well. @ the 
 moment It's not easy for a user to change this without compiling the source.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1427) Convert old .mapred API to new .mapreduce

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001034#comment-14001034
 ] 

Sebastian Schelter commented on MAHOUT-1427:


[~smarthi] what's the status here?

 Convert old .mapred API to new .mapreduce
 -

 Key: MAHOUT-1427
 URL: https://issues.apache.org/jira/browse/MAHOUT-1427
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering, Integration
Affects Versions: 0.9
Reporter: Suneel Marthi
Assignee: Suneel Marthi
Priority: Minor
 Fix For: 1.0

 Attachments: Mahout-1427.patch


 Migrate code still using the old .mapred to .mapreduce API



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1495) Create a website describing the distributed item-based recommender

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001017#comment-14001017
 ] 

Sebastian Schelter commented on MAHOUT-1495:


[~apsaltis] what's the status here?

 Create a website describing the distributed item-based recommender
 --

 Key: MAHOUT-1495
 URL: https://issues.apache.org/jira/browse/MAHOUT-1495
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering, Documentation
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1549) Extracting tfidf-vectors by key

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001030#comment-14001030
 ] 

Sebastian Schelter commented on MAHOUT-1549:


[~Pilgrim] has your question been answered yet?

 Extracting tfidf-vectors by key
 ---

 Key: MAHOUT-1549
 URL: https://issues.apache.org/jira/browse/MAHOUT-1549
 Project: Mahout
  Issue Type: Question
  Components: Classification
Affects Versions: 0.7, 0.8, 0.9
Reporter: Richard Scharrer
  Labels: documentation, features, newbie

 Hi,
 I have about 20 tfidf-vectors and I need to extract 500 of them of which 
 I have the keys. Is there some kind of magical option which allows me 
 something like taking the output of mahout seqdumper and transform it back 
 into a sequencefile that I can use for trainnb /testnb? The sequencefiles of 
 tfidf use the Text class for the keys and the VectorWritable class for the 
 values. I tried 
 https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
 with different settings but the output always gives me the Text class for 
 both, key and value which can't be used in trainnb and testnb.
 I posted this question on:
 http://stackoverflow.com/questions/23502362/extracting-tfidf-vectors-by-key-without-destroying-the-fileformat
 I ask this question in here because I've seen similar questions on 
 stackoverflow that where asked last year and still didn't get an answer
 I really need this information so in case you know anything please tell me.
 Regards,
 Richard



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1425) SGD classifier example with bank marketing dataset

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001024#comment-14001024
 ] 

Sebastian Schelter commented on MAHOUT-1425:


[~frankscholten] what's the status here?

 SGD classifier example with bank marketing dataset
 --

 Key: MAHOUT-1425
 URL: https://issues.apache.org/jira/browse/MAHOUT-1425
 Project: Mahout
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.0
Reporter: Frank Scholten
Assignee: Frank Scholten
 Fix For: 1.0

 Attachments: MAHOUT-1425.patch


 As discussed on the mailing list a few weeks back I started working on an SGD 
 classifier example with the bank marketing dataset from UCI: 
 http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
 See https://github.com/frankscholten/mahout-sgd-bank-marketing
 Ted has also made further changes that were very useful so I suggest to add 
 this example to Mahout
 Ted: can you tell a bit more about the log transforms? Some of them are just 
 Math.log while others are more complex expressions. 
 What else is needed to contribute it to Mahout? Anything that could improve 
 the example?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1487) More understandable error message when attempt to use wrong FileSystem

2014-05-18 Thread Sebastian Schelter (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Schelter resolved MAHOUT-1487.

Resolution: Won't Fix

no activity in four weeks

More understandable error message when attempt to use wrong FileSystem
--

Key: MAHOUT-1487
URL: https://issues.apache.org/jira/browse/MAHOUT-1487
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.9
Environment: Amazon S3, Amazon EMR, Local file system
Reporter: Konstantin
Priority: Trivial
Fix For: 1.0

RandomSeedGenerator has following code:
FileSystem fs = FileSystem.get(output.toUri(), conf);
...
fs.getFileStatus(input).isDir()
If specify output path correctly and input path not correctly, Mahout throws
not well understandable error message. Exception in thread main
java.lang.IllegalArgumentException: This file system object
(hdfs://172.31.41.65:9000) does not support access to the request path
's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your path
This happens because FileSystem object was created from output path, and
getFileStatus has parameter for input path. This caused misunderstanding when
try to understand what error message means.
To prevent this misunderstanding, I propose to improve error message adding
following details:
1. Specify which filesystem type used (DistributedFileSystem,
NativeS3FileSystem, etc. using fs.getClass().getName())
2. Then specify which path can not be processed correctly.
This can be done by validation utility which can be applied to many places in
Mahout. When we use Mahout we need to specify many paths and we also can use
many types of file systems: local for debugging, distributed on Hadoop, and
s3 on Amazon. In this case better error messages can save much time.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: Build failed in Jenkins: Mahout-Quality #2608

2014-05-18 Thread Sebastian Schelter


Does someone have to check why the build is still failing?

On 05/13/2014 01:14 AM, Apache Jenkins Server wrote:

See https://builds.apache.org/job/Mahout-Quality/2608/

--
[...truncated 8432 lines...]
}

Q=
{
   0  = {0:0.40273861426601687,1:-0.9153150324187648}
   1  = {0:0.9153150324227656,1:0.40273861426427493}
}
[32m- C = A %*% B mapBlock {}[0m
[32m- C = A %*% B incompatible B keys[0m
36495 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
[32m- C = At %*% B , join[0m
37989 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
[32m- C = At %*% B , join, String-keyed[0m
39452 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are identically 
distributed, performing row-wise zip.
[32m- C = At %*% B , zippable, String-keyed[0m
{
   2  = {0:62.0,1:86.0,3:132.0,2:115.0}
   1  = {0:50.0,1:69.0,3:105.0,2:92.0}
   3  = {0:74.0,1:103.0,3:159.0,2:138.0}
   0  = {0:26.0,1:35.0,3:51.0,2:46.0}
}
[32m- C = A %*% inCoreB[0m
{
   0  = {0:26.0,1:35.0,2:46.0,3:51.0}
   1  = {0:50.0,1:69.0,2:92.0,3:105.0}
   2  = {0:62.0,1:86.0,2:115.0,3:132.0}
   3  = {0:74.0,1:103.0,2:138.0,3:159.0}
}
[32m- C = inCoreA %*%: B[0m
43683 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
[32m- C = A.t %*% A[0m
45370 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying non-slim non-graph A'A.
70680 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings  - test done.
[32m- C = A.t %*% A fat non-graph[0m
71986 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
[32m- C = A.t %*% A non-int key[0m
[32m- C = A + B[0m
[32m- C = A + B side test 1[0m
[32m- C = A + B side test 2[0m
[32m- C = A + B side test 3[0m
ArrayBuffer(0, 1, 2, 3, 4)
ArrayBuffer(0, 1, 2, 3, 4)
[32m- general side[0m
[32m- Ax[0m
[32m- A'x[0m
[32m- colSums, colMeans[0m
[36mRun completed in 1 minute, 31 seconds.[0m
[36mTotal number of tests run: 38[0m
[36mSuites: completed 9, aborted 0[0m
[36mTests: succeeded 38, failed 0, canceled 0, ignored 0, pending 0[0m
[32mAll tests passed.[0m
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-spark ---
[INFO] /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark removed.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO]
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ 
mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO]
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ mahout-spark 
---
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.jar
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/pom.xml to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.pom
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO]
[INFO]  maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ mahout-spark 
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:add-source (add-source) @ mahout-spark 
---
[INFO] Source directory: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/generated-sources/mahout
 added.
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:add-test-source (add-test-source) @ 
mahout-spark ---
[INFO] Test Source directory: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/generated-test-sources/mahout
 added.
[INFO]
[INFO]  maven-javadoc-plugin:2.9.1:javadoc (default-cli) @

[jira] [Resolved] (MAHOUT-1484) Spectral algorithm for HMMs

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1484.


Resolution: Won't Fix

no activity in four weeks

 Spectral algorithm for HMMs
 ---

 Key: MAHOUT-1484
 URL: https://issues.apache.org/jira/browse/MAHOUT-1484
 Project: Mahout
  Issue Type: New Feature
Reporter: Emaad Manzoor
Priority: Minor

 Following up with this 
 [comment|https://issues.apache.org/jira/browse/MAHOUT-396?focusedCommentId=12898284page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12898284]
  by [~isabel] on the sequential HMM 
 [proposal|https://issues.apache.org/jira/browse/MAHOUT-396], is there any 
 interest in a spectral algorithm as described in: A spectral algorithm for 
 learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)?
 I would like to take up this effort.
 This will enable learning the parameters of and making predictions with a HMM 
 in a single step. At its core, the algorithm involves computing estimates 
 from triples of observations, performing an SVD and then some matrix 
 multiplications.
 This could also form the base for an implementation of Hilbert Space 
 Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, 
 A. Smola).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1536) Update Creating vectors from text page

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001036#comment-14001036
 ] 

Sebastian Schelter commented on MAHOUT-1536:


[~Andrew_Palumbo] did you have time work on this yet?

 Update Creating vectors from text page
 

 Key: MAHOUT-1536
 URL: https://issues.apache.org/jira/browse/MAHOUT-1536
 Project: Mahout
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9
Reporter: Andrew Palumbo
Priority: Minor
 Fix For: 1.0

 Attachments: MAHOUT-1536_edit1.patch


 At least the seq2sparse section of the Creating vectors from text page is 
 out of date.  
 https://mahout.apache.org/users/basics/creating-vectors-from-text.html
   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1528) Source tag and source release tarball for 0.9 don't exactly match

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1528.


Resolution: Later

Thank you for raising this issue, we will keep that in mind for the next 
release. Especially the CHANGELOG file should be part of the distribution!

 Source tag and source release tarball for 0.9 don't exactly match
 -

 Key: MAHOUT-1528
 URL: https://issues.apache.org/jira/browse/MAHOUT-1528
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.9
Reporter: Mark Grover

 If you download the source tarball for the Apache Mahout 0.9 release, you'd 
 notice that it doesn't contain CHANGELOG or .gitignore file. However, if you 
 look at the tag for the release in the github repo 
 (https://github.com/apache/mahout/tree/mahout-0.9), you'd notice both the 
 files there.
 I think, both as a best practice and to make life of downstream integrators 
 less miserable, it would be fantastic if we could have the release tag in the 
 source match one to one with the source code in the released source tarball. 
 A test to do this in particular, would be awesome!
 Thanks in advance!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001021#comment-14001021
 ] 

Sebastian Schelter commented on MAHOUT-1388:


[~yxjiang] what's the status here?

 Add command line support and logging for MLP
 

 Key: MAHOUT-1388
 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 1.0
Reporter: Yexi Jiang
Assignee: Suneel Marthi
  Labels: mlp, sgd
 Fix For: 1.0

 Attachments: Mahout-1388.patch, Mahout-1388.patch


 The user should have the ability to run the Perceptron from the command line.
 There are two programs to execute MLP, the training and labeling. The first 
 one takes the data as input and outputs the model, the second one takes the 
 model and unlabeled data as input and outputs the results.
 The parameters for training are as follows:
 
 --input -i (input data)
 --skipHeader -sk // whether to skip the first row, this parameter is optional
 --labels -labels // the labels of the instances, separated by whitespace. 
 Take the iris dataset for example, the labels are 'setosa versicolor 
 virginica'.
 --model -mo  // in training mode, this is the location to store the model (if 
 the specified location has an existing model, it will update the model 
 through incremental learning), in labeling mode, this is the location to 
 store the result
 --update -u // whether to incremental update the model, if this parameter is 
 not given, train the model from scratch
 --output -o   // this is only useful in labeling mode
 --layersize -ls (no. of units per hidden layer) // use whitespace separated 
 number to indicate the number of neurons in each layer (including input layer 
 and output layer), e.g. '5 3 2'.
 --squashingFunction -sf // currently only supports Sigmoid
 --momentum -m 
 --learningrate -l
 --regularizationweight -r
 --costfunction -cf   // the type of cost function,
 
 For example, train a 3-layer (including input, hidden, and output) MLP with 
 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the 
 parameter would be:
 mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o 
 /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01
 This command would read the training data from /tmp/training-data.csv and 
 write the trained model to /tmp/model.model.
 The parameters for labeling is as follows:
 -
 --input -i // input file path
 --columnRange -cr // the range of column used for feature, start from 0 and 
 separated by whitespace, e.g. 0 5
 --format -f // the format of input file, currently only supports csv
 --model -mo // the file path of the model
 --output -o // the output path for the results
 -
 If a user need to use an existing model, it will use the following command:
 mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result
 Moreover, we should be providing default values if the user does not specify 
 any. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1385) Caching Encoders don't cache

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001032#comment-14001032
 ] 

Sebastian Schelter commented on MAHOUT-1385:


[~awmanoj] whats the status here?

 Caching Encoders don't cache
 

 Key: MAHOUT-1385
 URL: https://issues.apache.org/jira/browse/MAHOUT-1385
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor
 Fix For: 1.0

 Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch


 The Caching... line of encoders contains code of caching the hash code terms 
 added to the vector. However, the method hashForProbe inside this classes 
 is never called as the signature has String for the parameter original form 
 (instead of byte[] like other encoders).
 Changing this to byte[] however would lose the java String internal caching 
 of the Strings hash code , that is used as a key in the cache map, triggering 
 another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1485) Clean up Recommender Overview page

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001089#comment-14001089
 ] 

Sebastian Schelter commented on MAHOUT-1485:


[~yash...@gmail.com] yash, the documentation looks great. Could you create a 
markdown version of it, so that we can add it to the mahout website?

 Clean up Recommender Overview page
 --

 Key: MAHOUT-1485
 URL: https://issues.apache.org/jira/browse/MAHOUT-1485
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


 Clean up the recommender overview page, remove outdated content and make sure 
 the examples work.
 https://mahout.apache.org/users/recommender/recommender-documentation.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1542.


Resolution: Fixed

added to the website. I also added a new top navigation point called Spark. 
Shout if you don't like that naming.

 Tutorial for playing with Mahout's Spark shell
 --

 Key: MAHOUT-1542
 URL: https://issues.apache.org/jira/browse/MAHOUT-1542
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation, Math
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


 I have a created a tutorial for setting up the spark shell and implementing a 
 simple linear regression algorithm. I'd love to make this part of the 
 website, could someone give it a review?
 https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
 PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
 to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001088#comment-14001088
 ] 

Sebastian Schelter commented on MAHOUT-1543:


[~larryhu] I have trouble applying your patch to the sources checked out from 
SVN. Could you check that the patch is svn compatible? Sorry for the trouble.

 JSON output format for classifying with random forests
 --

 Key: MAHOUT-1543
 URL: https://issues.apache.org/jira/browse/MAHOUT-1543
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.7, 0.8, 0.9
Reporter: larryhu
  Labels: patch
 Fix For: 0.7

 Attachments: MAHOUT-1543.patch


 This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1527:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed this with minor changes (removing a few typos, adding a check for 
MAHOUT_HOME to be set).

Thank you Andrew, keep up the outstanding work.

 Fix wikipedia classifier example
 

 Key: MAHOUT-1527
 URL: https://issues.apache.org/jira/browse/MAHOUT-1527
 Project: Mahout
  Issue Type: Task
  Components: Classification, Documentation, Examples
Affects Versions: 0.7, 0.8, 0.9
Reporter: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1527.patch


 The examples package has a classification showcase for prediciting the labels 
 of wikipedia  pages. Unfortunately, the example is totally broken:
 It relies on the old NB implementation which has been removed, suggests to 
 use the whole wikipedia as input, which will not work well on a single 
 machine and the documentation uses commands that have long been removed from 
 bin/mahout. 
 The example needs to be updated to use the current naive bayes implementation 
 and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1385) Caching Encoders don't cache

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1385:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I agree, Johannes is right that ideally we would want to leverage hashcode 
caching of Strings. But the current code is a non-working implementation, which 
this patch fixes. So I'm committing this for now.

 Caching Encoders don't cache
 

 Key: MAHOUT-1385
 URL: https://issues.apache.org/jira/browse/MAHOUT-1385
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor
 Fix For: 1.0

 Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch


 The Caching... line of encoders contains code of caching the hash code terms 
 added to the vector. However, the method hashForProbe inside this classes 
 is never called as the signature has String for the parameter original form 
 (instead of byte[] like other encoders).
 Changing this to byte[] however would lose the java String internal caching 
 of the Strings hash code , that is used as a key in the cache map, triggering 
 another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1498:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed with a few cosmetic changes, thank you for the contribution!

 DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
 using oozie
 -

 Key: MAHOUT-1498
 URL: https://issues.apache.org/jira/browse/MAHOUT-1498
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.7
 Environment: mahout-core-0.7-cdh4.4.0.jar
Reporter: Sergey
Assignee: Sebastian Schelter
  Labels: patch
 Fix For: 1.0

 Attachments: MAHOUT-1498.patch


 Hi, I get exception 
 {code}
  Invocation of Main class completed 
 Failing Oozie Launcher, Main class 
 [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
 exception, Job failed!
 java.lang.IllegalStateException: Job failed!
 at 
 org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
 at 
 org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
 at 
 org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
 {code}
 The root cause is:
 {code}
 Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247
 {code}
 Looks like it happens because of 
 DictionaryVectorizer.makePartialVectors method.
 It has code:
 {code}
 DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
 {code}
 which overrides jars pushed with job by oozie:
 {code}
 public static void More ...setCacheFiles(URI[] files, Configuration conf) {
  String sfiles = StringUtils.uriToString(files);
  conf.set(mapred.cache.files, sfiles);
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: consensus statement?

2014-05-18 Thread Sebastian Schelter

I think it is important to formulate such a statement and send it out 
the outside world. But we should focus the discussion. I suggest we 
start with a specific draft that someone prepares (maybe Ted as he 
started the thread) and then we can discuss and reformulate the 
individual sentences. I also think the formulation the committers work 
on Spark is not concise enough (and neglects a lot of our goals), but I 
also don't think it was meant to be part of an official statement in 
that exact wording.


--sebastian




On 05/18/2014 07:44 PM, Pat Ferrel wrote:

Not sure why you address this to me. I agree with most of your statements.

I think Ted’s intent was to find a simple consensus statement that addresses 
where the project is going in a general way. I look at it as something to 
communicate to the outside world. Why? We are rejecting new mapreduce code. 
This was announced as a project-wide rule and has already been used to reject 
one contribution I know of. OK, what replaces Hadoop mapreduce?  What therefore 
should contributors look to as a model if not Hadoop mapreduce? Do we give no 
advice or comment on this question?

For example, I’m doing drivers that read and write text files. This is quite 
tightly coupled to Spark. Possible contributors should know that this is OK, 
that it will not be rejected and is indeed where most of the engine specific 
work is being done by committers. You are right, most of us know what we are 
doing, but simply to say “no more mapreduce” without offering an alternative 
isn’t quite fair to everyone else.

You are abstracting your code away from a specific engine, and that is great, but in 
practice anyone running it currently must run Spark. This also needs to be 
communicated. It’s as practical as answering, “What do I need to install to make 
Mahout 1.0-snapshot work?

On May 15, 2014, at 7:17 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

Pat, it can't be as high-level or as dteailed as it can be, I don't care,
as long as it doesn't contain misstatements. It simply can state we adhere
to the Apache's power of doing principle and accept new contributions.
This is ok with me. But, as offered, it does try to enumerate strategic
directions, and in doing so, its wording is either vague, or incomplete, or
just wrong.


For example, it says it is clear that what the committers are working on
is Spark. This is less than accurate.

First, if I interpret it literally, it is wrong, as our committers for most
part are not working on Spark, and even if they do, to whatever negligible
degree it esxists, why Mahout would care.

Second, if it is meant to say we develop algorithms for Spark, this is
also wrong, because whatever algorithms we have added to day, have 0 Spark
dependencies.

Third, if it is meant to say that majority of what we are working on is
Spark bindings, this is still incorrect. Head count-wise, Mahout-math
tweaks and Scala enablement were at least a big effort. Hadoop 2.0 stuff
was at least as big. Documentation and tutorial work engagement was
absolute leader headcount-wise to date.

The problem i am trying to explain here is that we obviously internally
know what we are doing; but this is for external consumption so we have to
be careful to avoid miscommunication here. It is easy for us to pass on
less than accurate info delivery exactly because we already know what we
are doing and therefore our brain is happy to jump to conclusions and make
up the missing connections between stated and implied as we see it. But for
an outsider, this would sound vague or make him make wrong connections.



On Wed, May 7, 2014 at 9:54 AM, Pat Ferrel pat.fer...@gmail.com wrote:


This doesn’t seem to be a vision statement. I was +1 to a simple consensus
statement.

The vision is up to you.

We have an interactive shell that scales to huge datasets without
resorting to massive subsampling. One that allows you to deal with the
exact data your black box algos work on. Every data tool has an interactive
mode except Mahout—now it does.  Virtually every complex transform as well
as basic linear algebra works on massive datasets. The interactivity will
allow people to do things with Mahout they could never do before.

We also have the building blocks to make the fastest most flexible cutting
edge collaborative filtering+metadata recommenders in the world. Honestly I
don’t see anything like this elsewhere. We will also be able to fit into
virtually any workflow and directly consume data produced in those systems
with no intermediate scrubbing. This has never happened before in Mahout
and I don’t see it in MLlib either. Even the interactive shell will benefit
from this.

Other feature champions will be able to add to this list.

Seems like the vision comes from feature champions. I may not use Mahout
in the same way you do but I rely on your code. Maybe I serve a different
user type than you. I don’t see a problem with that, do you?

On May 6, 2014, at 2:32 PM, Dmitriy Lyubimov

[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-18 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001165#comment-14001165
 ] 

Sebastian Schelter commented on MAHOUT-1527:


Definitively. More examples and more documentation are always welcome :)

 Fix wikipedia classifier example
 

 Key: MAHOUT-1527
 URL: https://issues.apache.org/jira/browse/MAHOUT-1527
 Project: Mahout
  Issue Type: Task
  Components: Classification, Documentation, Examples
Affects Versions: 0.7, 0.8, 0.9
Reporter: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1527.patch


 The examples package has a classification showcase for prediciting the labels 
 of wikipedia  pages. Unfortunately, the example is totally broken:
 It relies on the old NB implementation which has been removed, suggests to 
 use the whole wikipedia as input, which will not work well on a single 
 machine and the documentation uses commands that have long been removed from 
 bin/mahout. 
 The example needs to be updated to use the current naive bayes implementation 
 and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAHOUT-1388) Add command line support and logging for MLP

2014-05-18 Thread Sebastian Schelter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1388:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed your patch with cosmetic changes, thank you. Could you open another 
JIRA for adding documentation on how to use MLP from the commandline?

 Add command line support and logging for MLP
 

 Key: MAHOUT-1388
 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 1.0
Reporter: Yexi Jiang
Assignee: Suneel Marthi
  Labels: mlp, sgd
 Fix For: 1.0

 Attachments: Mahout-1388.patch, Mahout-1388.patch


 The user should have the ability to run the Perceptron from the command line.
 There are two programs to execute MLP, the training and labeling. The first 
 one takes the data as input and outputs the model, the second one takes the 
 model and unlabeled data as input and outputs the results.
 The parameters for training are as follows:
 
 --input -i (input data)
 --skipHeader -sk // whether to skip the first row, this parameter is optional
 --labels -labels // the labels of the instances, separated by whitespace. 
 Take the iris dataset for example, the labels are 'setosa versicolor 
 virginica'.
 --model -mo  // in training mode, this is the location to store the model (if 
 the specified location has an existing model, it will update the model 
 through incremental learning), in labeling mode, this is the location to 
 store the result
 --update -u // whether to incremental update the model, if this parameter is 
 not given, train the model from scratch
 --output -o   // this is only useful in labeling mode
 --layersize -ls (no. of units per hidden layer) // use whitespace separated 
 number to indicate the number of neurons in each layer (including input layer 
 and output layer), e.g. '5 3 2'.
 --squashingFunction -sf // currently only supports Sigmoid
 --momentum -m 
 --learningrate -l
 --regularizationweight -r
 --costfunction -cf   // the type of cost function,
 
 For example, train a 3-layer (including input, hidden, and output) MLP with 
 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the 
 parameter would be:
 mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o 
 /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01
 This command would read the training data from /tmp/training-data.csv and 
 write the trained model to /tmp/model.model.
 The parameters for labeling is as follows:
 -
 --input -i // input file path
 --columnRange -cr // the range of column used for feature, start from 0 and 
 separated by whitespace, e.g. 0 5
 --format -f // the format of input file, currently only supports csv
 --model -mo // the file path of the model
 --output -o // the output path for the results
 -
 If a user need to use an existing model, it will use the following command:
 mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result
 Moreover, we should be providing default values if the user does not specify 
 any. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-16 Thread Sebastian Schelter (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998664#comment-13998664
 ] 

Sebastian Schelter commented on MAHOUT-1527:


I'll have a look at this on the weekend.




 Fix wikipedia classifier example
 

 Key: MAHOUT-1527
 URL: https://issues.apache.org/jira/browse/MAHOUT-1527
 Project: Mahout
  Issue Type: Task
  Components: Classification, Documentation, Examples
Affects Versions: 0.7, 0.8, 0.9
Reporter: Sebastian Schelter
 Fix For: 1.0

 Attachments: MAHOUT-1527.patch


 The examples package has a classification showcase for prediciting the labels 
 of wikipedia  pages. Unfortunately, the example is totally broken:
 It relies on the old NB implementation which has been removed, suggests to 
 use the whole wikipedia as input, which will not work well on a single 
 machine and the documentation uses commands that have long been removed from 
 bin/mahout. 
 The example needs to be updated to use the current naive bayes implementation 
 and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1485 matches

Mail list logo