Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-19 Thread Sebastian Schelter
Congrats!

2018-07-19 9:31 GMT+02:00 Peng Zhang :

> Congrats Andrew!
>
> On Thu, Jul 19, 2018 at 04:01 Andrew Musselman  >
> wrote:
>
> > Thanks Andy, looking forward to it! Thank you too for your support and
> > dedication the past two years; here's to continued progress!
> >
> > Best
> > Andrew
> >
> > On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo 
> > wrote:
> > > Please join me in congratulating Andrew Musselman as the new Chair of
> > > the
> > > Apache Mahout Project Management Committee. I would like to thank
> > > Andrew
> > > for stepping up, all of us who have worked with him over the years
> > > know his
> > > dedication to the project to be invaluable.  I look forward to Andrew
> > > taking taking the project into the future.
> > >
> > > Thank you,
> > >
> > > Andy
> >
>


Samsara at Sigmod

2017-05-16 Thread Sebastian

Hi,

Samsara is mentioned in a tutorial on large-scale ML at Sigmod:

https://dl.acm.org/citation.cfm?id=3054775

Best,
Sebastian


[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546112#comment-15546112
 ] 

Sebastian Schelter commented on MAHOUT-1884:


I know that this is already supported internally, I want to expose it as 
optional parameters to drmDfsRead. I disagree that caching an input matrix to 
read is always intended by the users, at least I want to be able to retain 
control over what is cached and what not.

> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-02 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1884:
--

 Summary: Allow specification of dimensions of a DRM
 Key: MAHOUT-1884
 URL: https://issues.apache.org/jira/browse/MAHOUT-1884
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.12.2
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Minor


Currently, in many cases, a DRM must be read to compute its dimensions when a 
user calls nrow or ncol. This also implicitly caches the corresponding DRM.

In some cases, the user actually knows the matrix dimensions (e.g., when the 
matrices are synthetically generated, or when some metadata about them is 
known). In such cases, the user should be able to specify the dimensions upon 
creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Traits for a mahout algorithm Library.

2016-07-21 Thread Sebastian

Hi Andrew,

I think this topic is broader than just defining a few traits. A popular 
 way of integrating ML algorithms is via the combination of dataframes 
and pipelines, similar to what scipy and SparkML are offering at the 
moment. Maybe it could make sense to integrate with what they have 
instead of starting our own efforts?


Best,
Sebastian



On 21.07.2016 04:35, Andrew Palumbo wrote:

Hi All,


I'd like to draw your attention to MAHOUT-1856:  
https://issues.apache.org/jira/browse/MAHOUT-1856


This is a discussion that has popped up several times over the last couple of 
years. as we move towards building out our algorithm library, It would be great 
 to nail this down now.


Most Importantly to not be able to be criticized as "a loose bag of algorithms" 
as we've sometimes been in the past.


The main point being It would be good to lay out  common traits for 
Classification, Clustering, and Optimization algorithms.


This is just a start. I created this issue a few months back, and intentionally 
left off Recommender, because I was unsure if there were common traits across 
them.  By traits, I am referring to both both the literal meaning and more 
specifically, actual Scala traits.


@pat, @tdunning, @ssc, could you give your thoughts on this?


As well, it would be good to add online flavors of different algorithm classes 
into the mix.


@tdunning could you share some thoughts here?


Trevor Grant will be heading up this effort, and It would be great if we all as a team 
could come up with abstract design plans for each class of algorithm (as well as to 
determine the current "classes of algorithms", as each of us has our own unique 
blend of specializations.  And could give our thoughts on this.


Currently this is really the opening of the conversation.


It would be best to post thoughts on: 
https://issues.apache.org/jira/browse/MAHOUT-1856


Any feedback is welcomed.


Thanks,


Andy





[jira] [Commented] (MAHOUT-1748) Mahout DSL for Flink: switch to Flink Scala API

2015-06-24 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599247#comment-14599247
 ] 

Sebastian Schelter commented on MAHOUT-1748:


+1 makes sense

> Mahout DSL for Flink: switch to Flink Scala API
> ---
>
> Key: MAHOUT-1748
> URL: https://issues.apache.org/jira/browse/MAHOUT-1748
> Project: Mahout
>  Issue Type: Task
>  Components: Math
>Affects Versions: 0.10.2
>Reporter: Alexey Grigorev
>Priority: Minor
>
> In Flink-Mahout (MAHOUT-1570) Flink Java API is used because Scala API caused 
> different strange compilation problems. 
> But Scala API handles types better than Flink Java API, so it's better to 
> switch to Scala API. It also can solve MAHOUT-1747



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: JIRA's with no commits

2015-06-18 Thread Sebastian

The ASF mandates that all relevant discussions happen on the mailinglist.

Best,
Sebastian

On 18.06.2015 10:44, Andrew Musselman wrote:

What do you mean no-go, that there's no reasonable way to incorporate
discussion from other channels to the list?

On Thu, Jun 18, 2015 at 1:21 AM, Sebastian  wrote:


Having these discussions in a non-public environment prevents all
non-invited people (e.g. all non-committers) from participating in the
development. I think this is a huge no-go.

Best,
Sebastian


On 18.06.2015 09:43, Ted Dunning wrote:


On Thu, Jun 18, 2015 at 12:36 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

  Capturing discussion in a public format and archiving the discussion

would
be preferable to fragmenting across lists, PR comments, and Slack, but
the
tools are all valuable, and until we find a way to build a digest for the
archives I support using them all.



Actually, capturing the design discussion on the list is not just
preferable.

It is required.

Using alternative tools is fine and all, but not if it compromises that
core requirement.






Re: JIRA's with no commits

2015-06-18 Thread Sebastian
Having these discussions in a non-public environment prevents all 
non-invited people (e.g. all non-committers) from participating in the 
development. I think this is a huge no-go.


Best,
Sebastian

On 18.06.2015 09:43, Ted Dunning wrote:

On Thu, Jun 18, 2015 at 12:36 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:


Capturing discussion in a public format and archiving the discussion would
be preferable to fragmenting across lists, PR comments, and Slack, but the
tools are all valuable, and until we find a way to build a digest for the
archives I support using them all.



Actually, capturing the design discussion on the list is not just
preferable.

It is required.

Using alternative tools is fine and all, but not if it compromises that
core requirement.



[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585002#comment-14585002
 ] 

Sebastian Schelter commented on MAHOUT-1739:


The FileItemSimilarity class reads the output of ItemSimilarityJob. You can 
then use the resulting ItemSimilarity with Mahout's recommenders.

[1] 
https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/cf/taste/impl/similarity/file/FileItemSimilarity.java

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> 
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.10.0
>Reporter: lariven
>  Labels: easyfix, patch
> Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed 
> the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
> if (itemID < otherItemID) {
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> } else {
>   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1739:
---
Resolution: Not A Problem
Status: Resolved  (was: Patch Available)

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> 
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.10.0
>Reporter: lariven
>  Labels: easyfix, patch
> Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed 
> the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
> if (itemID < otherItemID) {
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> } else {
>   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584986#comment-14584986
 ] 

Sebastian Schelter commented on MAHOUT-1739:


We have code that takes this triangular matrix and uses it as an ItemSimilarity 
for our recommenders. In that case, users don't even have to care about the 
internal data presentation.

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> 
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.10.0
>Reporter: lariven
>  Labels: easyfix, patch
> Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed 
> the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
> if (itemID < otherItemID) {
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> } else {
>   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584970#comment-14584970
 ] 

Sebastian Schelter commented on MAHOUT-1739:


Actually, this is exactly what we want. All the similarity measures used in 
Mahout are symmetric, so the upper triangular part of the similarity matrix 
already contains all information.

I think I also know where this "bug" comes from. Its actually not a bug, but 
the parameter maxSimilarItemsPerItem is not named very good.

Lets say maxSimilarItemsPerItem is 10. Now for an item A, we compute the 10 
most similar items. There might be an item B for which A is in its 10 most 
similar items, but B is not in the 10 most similar items of A. In order to 
guarantee that we have 10 most similar items for B, we must output 11 similar 
items for A unfortunately. 

Does that make sense?


> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> 
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.10.0
>Reporter: lariven
>  Labels: easyfix, patch
> Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed 
> the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
> if (itemID < otherItemID) {
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> } else {
>   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-12 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584479#comment-14584479
 ] 

Sebastian Schelter commented on MAHOUT-1739:


Could you supply a unit test that shows a case where the 
maxSimilarItemsPerItems is not correctly handled by the current code?

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> 
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.10.0
>Reporter: lariven
>  Labels: easyfix, patch
> Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed 
> the number of similar items we set to maxSimilarItemsPerItem  parameter. the 
> following code of ItemSimilarityJob.java about line NO. 200 may affect:
> if (itemID < otherItemID) {
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> } else {
>   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

2015-06-12 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584474#comment-14584474
 ] 

Sebastian Schelter commented on MAHOUT-1739:


Could you supply a unit test that clearly shows that this is not working?

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> 
>
> Key: MAHOUT-1739
> URL: https://issues.apache.org/jira/browse/MAHOUT-1739
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.10.0
>Reporter: lariven
>  Labels: easyfix, patch
> Fix For: 0.10.0, 0.10.1
>
>
> the output of may exceed the number of similar items we set to this 
> parameter. the following code of ItemSimilarityJob.java about line NO. 200 
> may affect:
> if (itemID < otherItemID) {
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> } else {
>   ctx.write(new EntityEntityWritable(otherItemID, itemID), new 
> DoubleWritable(similarItem.getSimilarity()));
> }
> Don't know why need to switch itemID with otherItemID, but I think a single 
> line is enough:
>   ctx.write(new EntityEntityWritable(itemID, otherItemID), new 
> DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-08 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1570:
---
Comment: was deleted

(was: I don't think it makes sense to issue pull requests with unfinished code.)

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-08 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1570:
---
Comment: was deleted

(was: I don't think it makes sense to issue pull requests with unfinished code.)

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-08 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578440#comment-14578440
 ] 

Sebastian Schelter commented on MAHOUT-1570:


I don't think it makes sense to issue pull requests with unfinished code.

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-08 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578439#comment-14578439
 ] 

Sebastian Schelter commented on MAHOUT-1570:


I don't think it makes sense to issue pull requests with unfinished code.

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-06-08 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578438#comment-14578438
 ] 

Sebastian Schelter commented on MAHOUT-1570:


I don't think it makes sense to issue pull requests with unfinished code.

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Suneel Marthi
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Streaming and incremental cooccurrence

2015-05-09 Thread Sebastian
Co-occurrence matrices shold be fairly easy to partition over many 
machines, so you would not be constrained by the memory available on a 
single machine.


On 06.05.2015 18:29, Pat Ferrel wrote:

100GB of RAM is practically common. Recently I’ve seen many indicators and item 
metadata stored with cooccurrence and indexed. This produces extremely flexible 
results since the query determines the result, not the model. But it does 
increase the number of cooccurrences linearly with # of indicator types.

As to DB, any suggestions? It would need to have a very high performance memory 
cached implementation. I wonder if the search engine itself would work. This 
would at least reduce the number of subsystems to deal with.

On Apr 24, 2015, at 4:13 PM, Ted Dunning  wrote:

Sounds about right.

My guess is that memory is now large enough, especially on a cluster that
the cooccurrence will fit into memory quite often.  Taking a large example
of 10 million items and 10,000 cooccurrences each, there will be 100
billion cooccurrences to store which shouldn't take more than about half a
TB of data if fully populated.  This isn't that outrageous any more.  With
SSD's as backing store, even 100GB of RAM or less might well produce very
nice results.  Depending on incoming transaction rates, using spinning disk
as a backing store might also work with small memory.

Experiments are in order.



On Fri, Apr 24, 2015 at 8:12 AM, Pat Ferrel  wrote:


Ok, seems right.

So now to data structures. The input frequency vectors need to be paired
with each input interaction type and would be nice to have as something
that can be copied very fast as they get updated. Random access would also
be nice but iteration is not needed. Over time they will get larger as all
items get interactions, users will get more actions and appear in more
vectors (with multi-intereaction data). Seems like hashmaps?

The cooccurrence matrix is more of a question to me. It needs to be
updatable at the row and column level, and random access for both row and
column would be nice. It needs to be expandable. To keep it small the keys
should be integers, not full blown ID strings. There will have to be one
matrix per interaction type. It should be simple to update the Search
Engine to either mirror the matrix of use it directly for index updates.
Each indicator update should cause an index update.

Putting aside speed and size issues this sounds like a NoSQL DB table that
is cached in-memeory.

On Apr 23, 2015, at 3:04 PM, Ted Dunning  wrote:

On Thu, Apr 23, 2015 at 8:53 AM, Pat Ferrel  wrote:


This seems to violate the random choice of interactions to cut but now
that I think about it does a random choice really matter?



It hasn't ever mattered such that I could see.  There is also some reason
to claim that earliest is best if items are very focussed in time.  Of
course, the opposite argument also applies.  That leaves us with empiricism
where the results are not definitive.

So I don't think that it matters, but I don't think that it does.






[jira] [Commented] (MAHOUT-1570) Adding support for Apache Flink as a backend for the Mahout DSL

2015-04-30 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521581#comment-14521581
 ] 

Sebastian Schelter commented on MAHOUT-1570:


great to see this finally happening

> Adding support for Apache Flink as a backend for the Mahout DSL
> ---
>
> Key: MAHOUT-1570
> URL: https://issues.apache.org/jira/browse/MAHOUT-1570
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Till Rohrmann
>Assignee: Sebastian Schelter
>  Labels: DSL, flink, scala
>
> With the finalized abstraction of the Mahout DSL plans from the backend 
> operations (MAHOUT-1529), it should be possible to integrate further backends 
> for the Mahout DSL. Apache Flink would be a suitable candidate to act as a 
> good execution backend. 
> With respect to the implementation, the biggest difference between Spark and 
> Flink at the moment is probably the incremental rollout of plans, which is 
> triggered by Spark's actions and which is not supported by Flink yet. 
> However, the Flink community is working on this issue. For the moment, it 
> should be possible to circumvent this problem by writing intermediate results 
> required by an action to HDFS and reading from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: co-occurrence paper and code

2014-08-06 Thread Sebastian Schelter
Sounds good to me.

-s
Am 06.08.2014 17:15 schrieb "Dmitriy Lyubimov" :

> what i mean here i probably need to refactor it a little so that there's
> part of algorithm that accepts co-occurrence input directly and which is
> somewhat decoupled from the part that accepts u x item input and does
> downsampling and co-occurrence construction. So i could do some
> customization of my own to co-occurrence construction. Would that be
> reasonable if i do that?
>
>
> On Wed, Aug 6, 2014 at 5:12 PM, Dmitriy Lyubimov 
> wrote:
>
> > Asking because i am considering pulling this implementation but for some
> > (mostly political) reasons people want to try different things here.
> >
> > I may also have to start with a different way of constructing
> > co-occurrences, and may do a few optimizations there (i.e. priority queue
> > queing/enqueing does twice the work it really needs to do etc.)
> >
> >
> >
> >
> > On Wed, Aug 6, 2014 at 5:05 PM, Sebastian Schelter <
> > ssc.o...@googlemail.com> wrote:
> >
> >> I chose against porting all the similarity measures to the dsl version
> of
> >> the cooccurrence analysis for two reasons. First, adding the measures
> in a
> >> generalizable way makes the code superhard to read. Second, in
> practice, I
> >> have never seen something giving better results than llr. As Ted pointed
> >> out, a lot of the foundations of using similarity measures comes from
> >> wanting to predict ratings, which people never do in practice. I think
> we
> >> should restrict ourselves to approaches that work with implicit,
> >> count-like
> >> data.
> >>
> >> -s
> >> Am 06.08.2014 16:58 schrieb "Ted Dunning" :
> >>
> >> > On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov 
> >> > wrote:
> >> >
> >> > > On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov  >
> >> > > wrote:
> >> > >
> >> > > I suppose in that context LLR is considered a distance (higher
> scores
> >> > mean
> >> > > > more `distant` items, co-occurring by chance only)?
> >> > > >
> >> > >
> >> > > Self-correction on this one -- having given a quick look at llr
> paper
> >> > > again, it looks like it is actually a similarity (higher scores
> >> meaning
> >> > > more stable co-occurrences, i.e. it moves in the opposite direction
> of
> >> > >  p-value if it had been a classic  test
> >> > >
> >> >
> >> > LLR is a classic test.  It is essentially Pearson's chi^2 test without
> >> the
> >> > normal approximation.  See my papers[1][2] introducing the test into
> >> > computational linguistics (which ultimately brought it into all kinds
> of
> >> > fields including recommendations) and also references for the G^2
> >> test[3].
> >> >
> >> > [1] http://www.aclweb.org/anthology/J93-1003
> >> > [2] http://arxiv.org/abs/1207.1847
> >> > [3] http://en.wikipedia.org/wiki/G-test
> >> >
> >>
> >
> >
>


Re: co-occurrence paper and code

2014-08-06 Thread Sebastian Schelter
I chose against porting all the similarity measures to the dsl version of
the cooccurrence analysis for two reasons. First, adding the measures in a
generalizable way makes the code superhard to read. Second, in practice, I
have never seen something giving better results than llr. As Ted pointed
out, a lot of the foundations of using similarity measures comes from
wanting to predict ratings, which people never do in practice. I think we
should restrict ourselves to approaches that work with implicit, count-like
data.

-s
Am 06.08.2014 16:58 schrieb "Ted Dunning" :

> On Wed, Aug 6, 2014 at 5:49 PM, Dmitriy Lyubimov 
> wrote:
>
> > On Wed, Aug 6, 2014 at 4:21 PM, Dmitriy Lyubimov 
> > wrote:
> >
> > I suppose in that context LLR is considered a distance (higher scores
> mean
> > > more `distant` items, co-occurring by chance only)?
> > >
> >
> > Self-correction on this one -- having given a quick look at llr paper
> > again, it looks like it is actually a similarity (higher scores meaning
> > more stable co-occurrences, i.e. it moves in the opposite direction of
> >  p-value if it had been a classic  test
> >
>
> LLR is a classic test.  It is essentially Pearson's chi^2 test without the
> normal approximation.  See my papers[1][2] introducing the test into
> computational linguistics (which ultimately brought it into all kinds of
> fields including recommendations) and also references for the G^2 test[3].
>
> [1] http://www.aclweb.org/anthology/J93-1003
> [2] http://arxiv.org/abs/1207.1847
> [3] http://en.wikipedia.org/wiki/G-test
>


Re: Mahout V2

2014-07-05 Thread Sebastian Schelter
Nice. There is even still a huge potential for optimization in the spark
bindings.

-s
Am 05.07.2014 15:21 schrieb "Andrew Musselman" :

> Crazy awesome.
>
> > On Jul 5, 2014, at 4:19 PM, Pat Ferrel  wrote:
> >
> > I compared  spark-itemsimilatity to the Hadoop version on sample data
> that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the
> following speedup.
> >
> > PlatformElapsed Time
> > Mahout Hadoop0:20:37
> > Mahout Spark0:02:19
> >
> > This isn’t quite apples to apples because the Spark version does all the
> dictionary management, which is usually two extra jobs tacked on before and
> after the Hadoop job. I’ve done the complete pipeline using Hadoop and
> Spark now and can say that not only is it faster now but the old Hadoop way
> required keeping track of 10x more intermediate data and connecting up many
> more jobs to get the pipeline working. Now it’s just one job. You don’t
> need to worry about ID translation anymore and you get over 10x faster
> completion — this is one of those times when speed meets ease-of-use.
>


Re: cf/couccurence code

2014-06-19 Thread Sebastian Schelter

Hi Anand,

Yes, this should not contain anything spark-specific. +1 for moving it.

--sebastian



On 06/19/2014 08:38 PM, Anand Avati wrote:

Hi Pat and others,
I see that cf/CooccuranceAnalysis.scala is currently under spark. Is there
a specific reason? I see that the code itself is completely spark agnostic.
I tried moving the code under
math-scala/src/main/scala/org/apache/mahout/math/cf/ with the following
trivial patch:

diff --git 
a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
index ee44f90..bd20956 100644
--- a/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
+++ b/spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
@@ -22,7 +22,6 @@ import scalabindings._
  import RLikeOps._
  import drm._
  import RLikeDrmOps._
-import org.apache.mahout.sparkbindings._
  import scala.collection.JavaConversions._
  import org.apache.mahout.math.stats.LogLikelihood


and it seems to work just fine. From what I see, this should work just fine
on H2O as well with no changes.. Why give up generality and make it spark
specific?

Thanks





Re: H2O integration - intermediate progress update

2014-06-19 Thread Sebastian Schelter
I share the impression that the tone of conversation has not been very 
welcoming lately, be it intentional or not. I think that we should 
remind ourselves why we are working on open source and try to improve 
our ways of communication.


I think we should try to get as much people as possible together to sit 
on a table and have some face-to-face discussion during a beer or coffee.


--sebastian

On 06/19/2014 07:18 AM, Dmitriy Lyubimov wrote:

On Wed, Jun 18, 2014 at 10:03 PM, Ted Dunning  wrote:


On Wed, Jun 18, 2014 at 9:39 PM, Dmitriy Lyubimov 
wrote:


I did not mean to discourage
sincere search for answers.



The tone of answers has lately been very discouraging for those sincerely
searching for answers.  I think we as a community have a responsibility to
do better about this.  There is no need to be insulting to people asking
honest questions in a civil tone.



Ted, we've been at this already. There have been more arguments than
questions. I am just providing my counter arguments. Do you insist on terms
"insulting"? Cause this, you know, insulting. You are heading ad hominem
direction again.





Re: Engine specific algos

2014-06-18 Thread Sebastian Schelter
I think rejecting that contribution is the right thing to do. I think 
its very important to narrow our focus. Let us put our efforts into 
finishing and polishing what we are working on right now.


A big problem of the "old" mahout was that we set the barrier for 
contributions too low and ended up with lots of non-integrated, 
hard-to-use algorithms of varying quality.


What is the problem with not accepting a contribution? We agreed with 
Andy that this might be better suited for inclusion in Spark's codebase 
and I think that was the right decision.


-s

On 06/18/2014 10:29 PM, Pat Ferrel wrote:

Taken from: Re: [jira] [Resolved] (MAHOUT-1153) Implement streaming random 
forests


Also, we don't have any mappings for Spark Streaming -- so if your
implementation heavily relies on Spark streaming, i think Spark itself is
the right place for it to be a part of.


We are discouraging engine specific work? Even dismissing Spark Streaming as a 
whole?


As it stands we don't have purely (c) methods and indeed i believe these
methods may be totally engine-specific in which case mllib is one of
possibly good homes for them.


Adherence to a specific incarnation of an engine-neutral DSL has become a 
requirement for inclusion in Mahout? The current DSL cannot be extended? Or it 
can’t be extended with engine specific ways? Or it can’t be extended with Spark 
Streaming? I would have thought all of these things desirable otherwise we are 
limiting ourselves to a subset of what an engine can do or a subset of problems 
that the current DSL supports.

I hope I’m misreading this but it looks like we just discourage a contributor 
from adding post hadoop code in an interesting area to Mahout?





[jira] [Resolved] (MAHOUT-1580) Optimize getNumNonZeroElements

2014-06-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1580.


Resolution: Fixed

> Optimize getNumNonZeroElements
> --
>
> Key: MAHOUT-1580
> URL: https://issues.apache.org/jira/browse/MAHOUT-1580
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Sebastian Schelter
>    Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> getNumNonZeroElements in AbstractVector uses the nonZeroes -iterator 
> internally which adds a lot of overhead for certain types of vectors, e.g. 
> the dense ones. We should add custom implementations here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: H2O integration - intermediate progress update

2014-06-18 Thread Sebastian Schelter
Very cool to hear that!
Am 18.06.2014 02:38 schrieb "Ted Dunning" :

> Very cool, Anand.
>
> Very exciting as it makes the multi-engine story make much more sense.
>
>
> On Tue, Jun 17, 2014 at 5:22 PM, Anand Avati  wrote:
>
> > Still incomplete, everything does NOT work. But lots of progress and end
> is
> > in sight.
> >
> > - Development happening at
> > https://github.com/avati/mahout/commits/MAHOUT-1500. Note that I'm still
> > doing lots of commit --amend and git push --force as this is my private
> > tree.
> >
> > - Ground level build issues and classloader incompatibilities fixed.
> >
> > - Can load a matrix into H2O either from in core (through
> drmParallelize())
> > or HDFS (parser does not support seqfile yet)
> >
> > - Only Long type support for Row Keys so far.
> >
> > - mapBlock() works. This was the trickiest, other ops seem trivial in
> > comparison.
> >
> > Everything else yet to be done. However I will be putting in more time
> into
> > this over the coming days (was working less than part time on this so
> far.)
> >
> > Questions/comments welcome.
> >
>


[jira] [Created] (MAHOUT-1580) Optimize getNumNonZeroElements

2014-06-13 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1580:
--

 Summary: Optimize getNumNonZeroElements
 Key: MAHOUT-1580
 URL: https://issues.apache.org/jira/browse/MAHOUT-1580
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


getNumNonZeroElements in AbstractVector uses the nonZeroes -iterator internally 
which adds a lot of overhead for certain types of vectors, e.g. the dense ones. 
We should add custom implementations here.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-13 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030429#comment-14030429
 ] 

Sebastian Schelter commented on MAHOUT-1579:


Xiaomeng, could you create a pull request to https://github.com/apache/mahout 
on github? That would make it easier to review your code. 

> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job. 
> It's necessay to implement a data model which can load data from hadoop 
> filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
Ok, but the current implementation still gives the correct number, as it 
checks for accidental zeros.


I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.


--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:

The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:


I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The code
in AbstractVector clearly returns the non-zeros only:

 int count = 0;
 Iterator it = iterateNonZero();
 while (it.hasNext()) {
   if (it.next().get() != 0.0) {
 count++;
   }
 }
 return count;

On the other hand, the internal code seems broken here, why does
iterateNonZero potentially return 0's?

--sebastian






On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:



  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=14029345#comment-14029345 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

  https://github.com/apache/mahout/pull/12#issuecomment-45915940

  fix header to say MAHOUT-1464, then hit close and reopen, it will
restart the echo.


  Cooccurrence Analysis on Spark

--

  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0

  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.





--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The 
code in AbstractVector clearly returns the non-zeros only:


int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
  if (it.next().get() != 0.0) {
count++;
  }
}
return count;

On the other hand, the internal code seems broken here, why does 
iterateNonZero potentially return 0's?


--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

 https://github.com/apache/mahout/pull/12#issuecomment-45915940

 fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs 
on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be 
used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
several applications including cross-action recommendations.




--
This message was sent by Atlassian JIRA
(v6.2#6252)





[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028360#comment-14028360
 ] 

Sebastian Schelter commented on MAHOUT-1464:


Hi,

The computation of A'A is usually done without explicitly forming A'. 
Instead A'A is computed as the sum of outer products of rows of A.

--sebastian




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1578) Optimizations in matrix serialization

2014-06-11 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1578:
---

Description: 
MatrixWritable contains inefficient code in a few places:
 
 * type and size are stored with every vector, although they are the same for 
every vector
 * in some places vectors are added to the matrix via assign() in places where 
we could directly use the instance
 
 Issue Type: Improvement  (was: Bug)

> Optimizations in matrix serialization
> -
>
> Key: MAHOUT-1578
> URL: https://issues.apache.org/jira/browse/MAHOUT-1578
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> MatrixWritable contains inefficient code in a few places:
>  
>  * type and size are stored with every vector, although they are the same for 
> every vector
>  * in some places vectors are added to the matrix via assign() in places 
> where we could directly use the instance
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1578) Optimizations in matrix serialization

2014-06-11 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1578:
--

 Summary: Optimizations in matrix serialization
 Key: MAHOUT-1578
 URL: https://issues.apache.org/jira/browse/MAHOUT-1578
 Project: Mahout
  Issue Type: Bug
  Components: Math
Reporter: Sebastian Schelter
 Fix For: 1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter

Hi Pat,

We truncate the indicators to the top-k and you don't want the 
self-comparison in there. So I don't see a reason to not exclude it as 
early as possible.


--sebatian

On 06/10/2014 05:28 PM, Pat Ferrel wrote:

Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

 // exclude co-occurrences of the item with itself
 if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Sebastian Schelter

Hi Sahil,

don't worry, you're not breaking any rules. We removed the tree-based 
recommenders because we have never heard of anyone using them over the 
years.


--sebastian

On 06/10/2014 09:01 AM, Sahil Sharma wrote:

Hi,

Firstly I apologize if I'm breaking certain rules by mailing this way, I'm
new to this and would appreciate any help I could get.

I was just playing around with the tree-based Recommender ( which seems to
be deprecated in the current version "for the lack of use" ) .

Why was it deprecated?

Also, I just looked at the code, and it seems to be doing a lot of
redundant computations, for example we could store a matrix of
cluster-cluster distances ( and hence avoid recomputing the closest
clusters every time by updating the matrix whenever we merge two clusters)
and also , when trying to determine the farthest distance based similarity
between two clusters again the pair which realizes this could be stored ,
and updated upon merging so that this computation need not to repeated
again and again.

Just wondering if this repeated computation was not a reason for
deprecating the class ( since people might have found a slow recommender
"lacking use" ) .

Would be glad to hear the thoughts of others on this, and also implement an
efficient version if the community agrees.





Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter
Oh good catch! I had an extra binarize method before, so that the data 
was already binary. I merged that into the downsample code and must have 
overlooked that thing. You are right, numNonZeros is the way to go!



On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)







Contributions coming

2014-06-04 Thread Sebastian Schelter

Hi,

as you know we still have a lot of open documenttion tickets.

Therefore, I decided to offer these tickets as projects to students in a 
university lecture that I'm giving with some colleagues:


MAHOUT-1495
MAHOUT-1470
MAHOUT-1462
MAHOUT-1477
MAHOUT-1485
MAHOUT-1423
MAHOUT-1427
MAHOUT-1536
MAHOUT-1551
MAHOUT-1493

In the next weeks, the students will join the mailinglist and start 
working on the documentation and examples. Let's give them a warm 
welcome and help them learn how to produce open source software.


Best,
Sebastian



SparkBindings on a real cluster

2014-06-04 Thread Sebastian Schelter

Hi,

I did some experimentation with the spark bindings on a real cluster 
yesterday, as I had to run some experiments for a paper (unrelated to 
Mahout) that I'm currently writing. The experiment basically consists of 
multiplying a sparse data matrix by a super-sparse permutation-like 
matrix from the left. It took me the whole day to get it working, up to 
matrices with 500M entries.


I ran into lots of issues that we have to fix asap, unfortunately I 
don't have much time in the next weeks, so I'm just sharing a list of 
the issues that I ran into (maybe I'll find some time to create issues 
for these things on the weekend).


I think the major challenge for us will be to get choice of dense/sparse 
correct and put lots of work into memory efficiency. This could be a 
great hook for collaborating with the h20 folks, as they know how to 
make vector-like data small and computations fast.


Here's the list:

* our matrix serialization in MatrixWritable is seriously flawed, I ran 
into the following errors


  - the type information is stored with every vector although a matrix 
always only contains vectors of the same type
  - all entries of a TransposeView (and possibly other views) of a 
sparse matrix are serialized, resulting in OOM
  - for sparse row matrices, the vectors are set using assign instead 
of via constructor injection, this results in huge memory consumption 
and long creation times, as in some implementations, binary search is 
used for assignment


* a dense matrix is converted into a SparseRowMatrix with dense row 
vectors by blockify(), after serialization this becomes a dense matrix 
in sparse format (triggering OOMs)!


* drmFromHDFS does not have an option to set the number of desired 
partitions


* SparseRowMatrix with sequential vectors times SparseRowMatrix with 
sequential vectors is totally broken, it uses three nested loops and 
uses get(row, col) on the matrices, which internally uses binary search...


* At operator adds the column vectors it creates, this is unnecessary as 
we don't need the addition, we can just merge the vectors


* we need a dedicated operator for inCoreA %*% drmB, currently this gets 
rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I 
have a prototype of that operator)


Best,
Sebastian




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Sebastian Schelter
The important thing here is that we test the code on a sufficiently large
dataset on a real cluster. Take that on, if you want!
Am 02.06.2014 20:08 schrieb "Pat Ferrel (JIRA)" :

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015667#comment-14015667
> ]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> [~ssc] Should I reassign to me for now so we can get this committed?
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Problems with mapBlock()

2014-05-31 Thread Sebastian Schelter
I've updated the codebase to work on the cooccurrence analysis algo, but 
I always run into this error now:


error: value mapBlock is not a member of 
org.apache.mahout.math.drm.DrmLike[Int]


I have the feeling that an implicit conversion might be missing, but I 
couldn't figure out where to put it, with out producing even more errors.


--sebastian


[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014918#comment-14014918
 ] 

Sebastian Schelter commented on MAHOUT-1566:


If its a mere showcase, could we maybe add it as an example in an example 
package, not a full fledged algorithm implementation somehow?

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: mlib versus spark

2014-05-31 Thread Sebastian Schelter

Hi Saikat,

The differences are that MLLib offers a different set of algorithms 
(e.g. you want find cooccurrence analysis or stochastic svd) and that 
their codebase consists of hand-tuned, spark-specific implementations.


Mahout on the other hand, allows to implement algorithms in an 
engine-agnostic, declarative way. This allows for the automatic 
optimization of our algorithms as well as for running the same code on 
multiple backends (there has been interested from h20 as well as Apache 
Flink to integrate with our DSL).


--sebastian

On 06/01/2014 01:41 AM, Saikat Kanjilal wrote:

Actually the subject of my email should say spark->mlib versus mahout->spark :)


From: sxk1...@hotmail.com
To: dev@mahout.apache.org
Subject: mlib versus spark
Date: Sat, 31 May 2014 16:38:13 -0700

Ok I'll admit I'm not seeing what the obvious differences are, I'm a bit 
confused when I think of mahout using spark, since spark already uses an 
embedded machine learning library (mlib) what would be the impetus to use 
mahout instead, seems like you should be able to write or add algortihms to 
mlib and use spark, has someone from mahout looked at mlib to see if there will 
be a strongusecase for using one versus the other?
http://spark.apache.org/mllib/  







[jira] [Commented] (MAHOUT-1566) Regular ALS factorizer with convergence test.

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014573#comment-14014573
 ] 

Sebastian Schelter commented on MAHOUT-1566:


I'm not sure whether we should really include the "standard" ALS in the new 
codebase. It is optimized for rating prediction on Netflix-like data which 
rarely exists outside of academia. I think we should rather focus on the ALS 
version targeted for implicit data (clicks, views, etc).

> Regular ALS factorizer with convergence test.
> -
>
> Key: MAHOUT-1566
> URL: https://issues.apache.org/jira/browse/MAHOUT-1566
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 1.0
>
>
> ALS-related: let's start with unweighed, unregularized implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1565:
---

Fix Version/s: 1.0

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Fix For: 1.0
>
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1564) Naive Bayes Classifier for New Text Documents

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014572#comment-14014572
 ] 

Sebastian Schelter commented on MAHOUT-1564:


I don't see any reason to veto this, as it will make stuff that we have more 
useful.

> Naive Bayes Classifier for New Text Documents
> -
>
> Key: MAHOUT-1564
> URL: https://issues.apache.org/jira/browse/MAHOUT-1564
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> MapReduce Naive Bayes implementation currently lacks the ability to classify 
> a new document (outside of the training/holdout corpus).  I've begun some 
> work on a "ClassifyNew" job which will do the following:
> 1. Vectorize a new text document using the dictionary and document 
> frequencies from the training/holdout corpus 
> - assume the original corpus was vectorized using `seq2sparse`; step (1) 
> will use all of the same parameters. 
> 2. Score and label a new document using a previously trained model.
> I think that it will be a useful addition to the NB package.  Unfortunately, 
> this is going to be mostly MR workhorse code and doesn't really introduce 
> much new logic. I will try to keep any new logic separate from MR code so 
> that it can be called from scala for MAHOUT-1493.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014570#comment-14014570
 ] 

Sebastian Schelter commented on MAHOUT-1543:


Could you create a pull request to the current mahout codebase?

> JSON output format for classifying with random forests
> --
>
> Key: MAHOUT-1543
> URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: larryhu
>  Labels: patch
> Fix For: 0.7
>
> Attachments: MAHOUT-1543.patch
>
>
> This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014571#comment-14014571
 ] 

Sebastian Schelter commented on MAHOUT-1552:


Could you suggest a way to fix the bug?

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1552) Avoid new Configuration() instantiation

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1552:
---

Fix Version/s: 1.0

> Avoid new Configuration() instantiation
> ---
>
> Key: MAHOUT-1552
> URL: https://issues.apache.org/jira/browse/MAHOUT-1552
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: CDH 4.4, CDH 4.6
>Reporter: Sergey
> Fix For: 1.0
>
>
> Hi, it's related to MAHOUT-1498
> You get troubles when run mahout stuff from oozie java action.
> {code}
> ava.lang.InterruptedException: Cluster Classification Driver Job failed 
> processing /tmp/sku/tfidf/90453
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)  
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1551) Add document to describe how to use mlp with command line

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1551:
---

Fix Version/s: 1.0

> Add document to describe how to use mlp with command line
> -
>
> Key: MAHOUT-1551
> URL: https://issues.apache.org/jira/browse/MAHOUT-1551
> Project: Mahout
>  Issue Type: Documentation
>  Components: Classification, CLI, Documentation
>Affects Versions: 0.9
>Reporter: Yexi Jiang
>  Labels: documentation
> Fix For: 1.0
>
>
> Add documentation about the usage of multi-layer perceptron in command line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1524) Script to auto-generate and view the Mahout website on a local machine

2014-05-31 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1524:
---

Fix Version/s: 1.0

> Script to auto-generate and view the Mahout website on a local machine 
> ---
>
> Key: MAHOUT-1524
> URL: https://issues.apache.org/jira/browse/MAHOUT-1524
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Saleem Ansari
> Fix For: 1.0
>
> Attachments: mahout-website.sh
>
>
> Attached with this ticket is a script that creates a simple setup for editing 
> Mahout Website on a local machine.
> It is useful in the sense that, we can edit the source and the changes are 
> automatically reflected in the generated site. All we need to do is refresh 
> the browser. No further steps required.
> So now one can review the website changes ( the complete website ), on a 
> developer's machine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-29 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012136#comment-14012136
 ] 

Sebastian Schelter commented on MAHOUT-1565:


I'd favor removing that

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-27 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010124#comment-14010124
 ] 

Sebastian Schelter commented on MAHOUT-1529:


Hi Dmitriy,

the PR looks good, +1 from me, go ahead!

Best,
Sebastian




> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1446) Create an intro for matrix factorization

2014-05-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1446.


Resolution: Fixed
  Assignee: Sebastian Schelter

Jian, thank you very much, you did a great job. I put the page online, could 
you have a look at it?

Thx,
Sebastian

> Create an intro for matrix factorization
> 
>
> Key: MAHOUT-1446
> URL: https://issues.apache.org/jira/browse/MAHOUT-1446
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Maciej Mazur
>    Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: matrix-factorization.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1480) Clean up website on 20 newsgroups

2014-05-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1480:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed, thank you very much

> Clean up website on 20 newsgroups
> -
>
> Key: MAHOUT-1480
> URL: https://issues.apache.org/jira/browse/MAHOUT-1480
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1480_edit1.patch, MAHOUT-1480_edit2.patch
>
>
> The website on the twenty newsgroups example needs clean up. We need to go 
> through the text, remove dead links and check whether the information is 
> still consistent with the current code.
> https://mahout.apache.org/users/clustering/twenty-newsgroups.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1536) Update "Creating vectors from text" page

2014-05-25 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008319#comment-14008319
 ] 

Sebastian Schelter commented on MAHOUT-1536:


Added the changes. Can someone have a look at the lucene part of the site? We 
should post the currently used lucene version there and not require users to 
look into the POM for example.

> Update "Creating vectors from text" page
> 
>
> Key: MAHOUT-1536
> URL: https://issues.apache.org/jira/browse/MAHOUT-1536
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1536_edit1.patch, MAHOUT-1536_edit2.patch
>
>
> At least the seq2sparse section of the "Creating vectors from text" page is 
> out of date.  
> https://mahout.apache.org/users/basics/creating-vectors-from-text.html
>   



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1561) cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true

2014-05-24 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1561:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed, thank you very much

> cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true
> --
>
> Key: MAHOUT-1561
> URL: https://issues.apache.org/jira/browse/MAHOUT-1561
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Assignee: Sebastian Schelter
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1561.patch
>
>
> cluster-syntheticcontrol.sh is not running locally with MAHOUT_LOCAL set.  
> Patch adds a check for MAHOUT_LOCAL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1558) Clean up classify-wiki.sh and add in a binary classification problem

2014-05-24 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1558:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed, thank you for your great work

> Clean up classify-wiki.sh and add in a binary classification problem  
> --
>
> Key: MAHOUT-1558
> URL: https://issues.apache.org/jira/browse/MAHOUT-1558
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification, Examples
>Affects Versions: 1.0
>Reporter: Andrew Palumbo
>Assignee: Sebastian Schelter
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1558.patch
>
>
> Some minor cleanups to classify-wiki.sh.   Added in a 2 class problem: United 
> States and United Kingdom.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1560) Last batch is not filled correctly in MultithreadedBatchItemSimilarities

2014-05-24 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1560.


   Resolution: Fixed
Fix Version/s: 1.0
 Assignee: Sebastian Schelter

committed, thank you for the contribution

> Last batch is not filled correctly in MultithreadedBatchItemSimilarities
> 
>
> Key: MAHOUT-1560
> URL: https://issues.apache.org/jira/browse/MAHOUT-1560
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Jarosław Bojar
>Assignee: Sebastian Schelter
>Priority: Minor
> Fix For: 1.0
>
> Attachments: Corrected_last_batch_size_calculation.patch, 
> MultithreadedBatchItemSimilaritiesTest.patch
>
>
> In {{MultithreadedBatchItemSimilarities}} method {{queueItemIDsInBatches}} 
> handles last batch incorrectly. Last batch length is calculated incorrectly. 
> As a result last batch is either truncated or too long with superfluous 
> positions filled with item indexes from previous batch (or zeros if it is 
> also the first batch as in attached test).
> Attached test fails for very short model (with only 4 items) with 
> NoSuchItemException.
> Attached patch corrects this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1553) Fix for run Mahout stuff as oozie java action

2014-05-23 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1553:
---

Resolution: Not a Problem
Status: Resolved  (was: Patch Available)

closing this, as suneel said its already fixed

> Fix for run Mahout stuff as oozie java action
> -
>
> Key: MAHOUT-1553
> URL: https://issues.apache.org/jira/browse/MAHOUT-1553
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
> Attachments: MAHOUT-1553.patch
>
>
> Related to MAHOUT-1498, the problem is the same. mapred.job.classpath.files 
> property is not correctly pushed down to Mahout MR stuff because of new 
> Configuration usage
> at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
>   at 
> org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.java:372)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:158)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 
> org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1554) Provide more comprehensive classification statistics

2014-05-23 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1554:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed with a few cosmetic changes, thank you for the contribution

> Provide more comprehensive classification statistics
> 
>
> Key: MAHOUT-1554
> URL: https://issues.apache.org/jira/browse/MAHOUT-1554
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Karol Grzegorczyk
>Priority: Minor
> Fix For: 1.0
>
> Attachments: statistics.diff
>
>
> Currently only limited classification statistics are provided. To better 
> understand classification results, it would be worth to provide at lease 
> average precision, recall and F1 score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1555) Exception thrown when a test example has the label not present in training examples

2014-05-23 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007243#comment-14007243
 ] 

Sebastian Schelter commented on MAHOUT-1555:


Hi Karol,

Could you update the patch to at least log a warning in such a case?



> Exception thrown when a test example has the label not present in training 
> examples
> ---
>
> Key: MAHOUT-1555
> URL: https://issues.apache.org/jira/browse/MAHOUT-1555
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 1.0
>Reporter: Karol Grzegorczyk
>Priority: Minor
> Fix For: 1.0
>
> Attachments: test_label_not_present_in_training_examples.diff
>
>
> Currently an IllegalArgumentException is thrown when a test example has the 
> label (belongs to the class) not present in training examples. When the 
> number of labels is big, such a situation is likely and valid. The example of 
> course will be misclassified, but exception should not be thrown. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1557) Add support for sparse training vectors in MLP

2014-05-23 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007235#comment-14007235
 ] 

Sebastian Schelter commented on MAHOUT-1557:


Karol, your patch contains some errors, e.g. the variable position is set but 
never read in RunMultilayerPerceptron.

Furthermore, NeuralNetwork converts the input to a DenseVector internally in 
getOutput(), so you also have to modify that code.

> Add support for sparse training vectors in MLP
> --
>
> Key: MAHOUT-1557
> URL: https://issues.apache.org/jira/browse/MAHOUT-1557
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Karol Grzegorczyk
>Priority: Minor
>  Labels: mlp
> Fix For: 1.0
>
> Attachments: mlp_sparse.diff
>
>
> When the number of input units of MLP is big, it is likely that input vector 
> will be sparse. It should be possible to read input files in a sparse format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Hadoop 2 support in a real release?

2014-05-23 Thread Sebastian Schelter
Big +1
Am 23.05.2014 15:33 schrieb "Ted Dunning" :

> What do folks think about spinning out a new version of 0.9 that only
> changes which version of Hadoop the build uses?
>
> There have been quite a few questions lately on this topic.
>
> My suggestion would be that we use minor version numbering to maintain this
> and the normal 0.9 release simultaneously if we decide to do a bug fix
> release.
>
> Any thoughts?
>


[jira] [Commented] (MAHOUT-1556) Mahout for Hadoop2 - HDP2.1.1

2014-05-22 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005720#comment-14005720
 ] 

Sebastian Schelter commented on MAHOUT-1556:


You have to use the trunk version, 0.9 does not have the support for Hadoop 2 
yet.

This page has infos on how to build mahout for Hadoop 2: 
https://mahout.apache.org/developers/buildingmahout.html

Let us know if that doesn't work for you.

> Mahout for Hadoop2 - HDP2.1.1
> -
>
> Key: MAHOUT-1556
> URL: https://issues.apache.org/jira/browse/MAHOUT-1556
> Project: Mahout
>  Issue Type: Dependency upgrade
>  Components: Integration
>Affects Versions: 0.9
> Environment: Ubuntu 12.04, Centos6, Java Oracle 1.7
>Reporter: Prabhat K Singh
>  Labels: hadoop2
> Fix For: 0.9
>
>
> Hi, 
> I tried build and install of Mahout0.9 for hadoop HDP2.1.1 as per given 
> methods in https://issues.apache.org/jira/browse/MAHOUT-1329, but I get 
> errors as mentioned below.
> Method:
> mvn clean package  -Dhadoop.profile=200  -Dhadoop2.version=2.2.0 
> -Dhbase.version=0.98
> mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0
> mvn clean package -Dhadoop2 -Dhadoop.profile=200  -Dhadoop2.version=2.4.0 
> -Dhbase.version=0.98
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project mahout-integration: Compilation failure: Compilation failure:
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[30,31]
>  cannot find symbol
> [ERROR] symbol:   class HBaseConfiguration
> [ERROR] location: package org.apache.hadoop.hbase
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[33,31]
>  cannot find symbol
> [ERROR] symbol:   class KeyValue
> [ERROR] location: package org.apache.hadoop.hbase
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[47,36]
>  cannot find symbol
> [ERROR] symbol:   class Bytes
> [ERROR] location: package org.apache.hadoop.hbase.util
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[91,42]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[92,42]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[107,26]
>  cannot find symbol
> [ERROR] symbol:   variable HBaseConfiguration
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[138,51]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[206,26]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[207,25]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[233,15]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] location: class 
> org.apache.mahout.cf.taste.impl.model.hbase.HBaseDataModel
> [ERROR] 
> /home/user1/aws/mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/hbase/HBaseDataModel.java:[265,26]
>  cannot find symbol
> [ERROR] symbol:   variable Bytes
> [ERROR] l

[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-22 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005681#comment-14005681
 ] 

Sebastian Schelter commented on MAHOUT-1534:


Looks good, I think we should also mention the skipTests option for packaging 
and add a news entry for that.

> Add documentation for using Mahout with Hadoop2 to the website
> --
>
> Key: MAHOUT-1534
> URL: https://issues.apache.org/jira/browse/MAHOUT-1534
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Gokhan Capan
> Fix For: 1.0
>
>
> MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
> We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-05-21 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005631#comment-14005631
 ] 

Sebastian Schelter commented on MAHOUT-1464:


[~pferrel] Great, how large was your testdataset?

I'd vote against other similarity types for sake of similarity, LLR also works 
best in my experience

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004958#comment-14004958
 ] 

Sebastian Schelter commented on MAHOUT-1534:


I somehow cannot see the staged version unfortunately. Just publish it and I'll 
have a look. Maybe we should we even add an extra page and navigation point for 
that site, what do you think?

> Add documentation for using Mahout with Hadoop2 to the website
> --
>
> Key: MAHOUT-1534
> URL: https://issues.apache.org/jira/browse/MAHOUT-1534
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
> We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: consensus statement?

2014-05-21 Thread Sebastian Schelter
Big +1, very nicely captures what I also think

--sebastian
Am 21.05.2014 14:27 schrieb "Gokhan Capan" :

> I want to express my opinions for the vision, too. I tried to capture those
> words from various discussions in the dev-list, and hope that most, of them
> support the common sense of excitement the new Mahout arouses
>
> To me, the fundamental benefit of the shift that Mahout is undergoing is a
> better separation of the distributed execution engine, distributed data
> structures, matrix computations, and algorithms layers, which will allow
> the users/devs of Mahout with different roles focus on the relevant parts
> of the framework:
>
>1. A machine learning scientist, independent from the underlying
>distributed execution engine, can utilize the matrix language and the
>decompositions to implement new algorithms (which implies that the
> current
>distributed mahout algorithms are to be rewritten in the matrix
> language)
>2. A math-scala module contributor, for the benefit of higher level
>algorithms, can add new, or improve existing functions (the set of
>decompositions is an example) with optimization plans (such as if two
>matrices are partitioned in the same way, ...), where the concrete
>implementations of those optimizations are delegated to the distributed
>execution engine layer
>3. A distributed execution engine author can add machine learning
>capabilities to her platform with i)concrete Matrix and Matrix I/O
>implementation  ii)partitioning, checkpointing, broadcasting behaviors,
>iii)BLAS
>4. A Mahout user with access to a cluster operated by a
>Mahout-supporting distributed execution engine can run machine learning
>algorithms implemented on top of the matrix language
>
> Best
>
> Gokhan
>
>
> On Tue, May 20, 2014 at 8:30 PM, Dmitriy Lyubimov 
> wrote:
>
> > inline
> >
> >
> > On Tue, May 20, 2014 at 12:42 AM, Sebastian Schelter 
> > wrote:
> >
> > >
> > >>
> > > Let's take the next from our homepage as starting point. What should we
> > > add/remove/modify?
> > >
> > > 
> > > 
> > > The Mahout community decided to move its codebase onto modern data
> > > processing systems that offer a richer programming model and more
> > efficient
> > > execution than Hadoop MapReduce. Mahout will therefore reject new
> > MapReduce
> > > algorithm implementations from now on. We will however keep our widely
> > used
> > > MapReduce algorithms in the codebase and maintain them.
> > >
> > > We are building our future implementations on top of a
> >
> > Scala
> >
> > > DSL for linear algebraic operations which has been developed over the
> > last
> > > months. Programs written in this DSL are automatically optimized and
> > > executed in parallel for Apache Spark.
> >
> > More platforms to be added in the future.
> >
> > >
> > > Furthermore, there is an experimental contribution undergoing which
> aims
> > > to integrate the h20 platform into Mahout.
> > > 
> > > 
> > >
> >
>


[jira] [Updated] (MAHOUT-1554) Provide more comprehensive classification statistics

2014-05-21 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1554:
---

Fix Version/s: 1.0

> Provide more comprehensive classification statistics
> 
>
> Key: MAHOUT-1554
> URL: https://issues.apache.org/jira/browse/MAHOUT-1554
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Karol Grzegorczyk
>Priority: Minor
> Fix For: 1.0
>
> Attachments: statistics.diff
>
>
> Currently only limited classification statistics are provided. To better 
> understand classification results, it would be worth to provide at lease 
> average precision, recall and F1 score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: consensus statement?

2014-05-20 Thread Sebastian Schelter

On 05/18/2014 09:28 PM, Ted Dunning wrote:

On Sun, May 18, 2014 at 11:33 AM, Sebastian Schelter  wrote:


I suggest we start with a specific draft that someone prepares (maybe Ted
as he started the thread)



This is a good strategy, and I am happy to start the discussion, but I
wonder if it might help build consensus if somebody else started the ball
rolling.



Let's take the next from our homepage as starting point. What should we 
add/remove/modify?



The Mahout community decided to move its codebase onto modern data 
processing systems that offer a richer programming model and more 
efficient execution than Hadoop MapReduce. Mahout will therefore reject 
new MapReduce algorithm implementations from now on. We will however 
keep our widely used MapReduce algorithms in the codebase and maintain them.


We are building our future implementations on top of a DSL for linear 
algebraic operations which has been developed over the last months. 
Programs written in this DSL are automatically optimized and executed in 
parallel on Apache Spark.


Furthermore, there is an experimental contribution undergoing which aims 
to integrate the h20 platform into Mahout.




[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-19 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002271#comment-14002271
 ] 

Sebastian Schelter commented on MAHOUT-1542:


No, go ahead, thats a great idea.

> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-19 Thread Sebastian Schelter
Can you create in an svn compatible way and check that it works with the
current trunk?

Thx, sebastian
Am 19.05.2014 12:01 schrieb "larryhu (JIRA)" :

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001554#comment-14001554]
>
> larryhu commented on MAHOUT-1543:
> -
>
> I'm so sorry for your trouble, this patch is created by git, I clone it
> from github. tag: mahout-0.7.
>
> > JSON output format for classifying with random forests
> > --
> >
> > Key: MAHOUT-1543
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Classification
> >Affects Versions: 0.7, 0.8, 0.9
> >Reporter: larryhu
> >  Labels: patch
> > Fix For: 0.7
> >
> > Attachments: MAHOUT-1543.patch
> >
> >
> > This patch adds JSON output format to build random forests,
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Updated] (MAHOUT-1388) Add command line support and logging for MLP

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1388:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed your patch with cosmetic changes, thank you. Could you open another 
JIRA for adding documentation on how to use MLP from the commandline?

> Add command line support and logging for MLP
> 
>
> Key: MAHOUT-1388
> URL: https://issues.apache.org/jira/browse/MAHOUT-1388
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 1.0
>Reporter: Yexi Jiang
>Assignee: Suneel Marthi
>  Labels: mlp, sgd
> Fix For: 1.0
>
> Attachments: Mahout-1388.patch, Mahout-1388.patch
>
>
> The user should have the ability to run the Perceptron from the command line.
> There are two programs to execute MLP, the training and labeling. The first 
> one takes the data as input and outputs the model, the second one takes the 
> model and unlabeled data as input and outputs the results.
> The parameters for training are as follows:
> 
> --input -i (input data)
> --skipHeader -sk // whether to skip the first row, this parameter is optional
> --labels -labels // the labels of the instances, separated by whitespace. 
> Take the iris dataset for example, the labels are 'setosa versicolor 
> virginica'.
> --model -mo  // in training mode, this is the location to store the model (if 
> the specified location has an existing model, it will update the model 
> through incremental learning), in labeling mode, this is the location to 
> store the result
> --update -u // whether to incremental update the model, if this parameter is 
> not given, train the model from scratch
> --output -o   // this is only useful in labeling mode
> --layersize -ls (no. of units per hidden layer) // use whitespace separated 
> number to indicate the number of neurons in each layer (including input layer 
> and output layer), e.g. '5 3 2'.
> --squashingFunction -sf // currently only supports Sigmoid
> --momentum -m 
> --learningrate -l
> --regularizationweight -r
> --costfunction -cf   // the type of cost function,
> 
> For example, train a 3-layer (including input, hidden, and output) MLP with 
> 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the 
> parameter would be:
> mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o 
> /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01
> This command would read the training data from /tmp/training-data.csv and 
> write the trained model to /tmp/model.model.
> The parameters for labeling is as follows:
> -
> --input -i // input file path
> --columnRange -cr // the range of column used for feature, start from 0 and 
> separated by whitespace, e.g. 0 5
> --format -f // the format of input file, currently only supports csv
> --model -mo // the file path of the model
> --output -o // the output path for the results
> -
> If a user need to use an existing model, it will use the following command:
> mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result
> Moreover, we should be providing default values if the user does not specify 
> any. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001165#comment-14001165
 ] 

Sebastian Schelter commented on MAHOUT-1527:


Definitively. More examples and more documentation are always welcome :)

> Fix wikipedia classifier example
> 
>
> Key: MAHOUT-1527
> URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Documentation, Examples
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: consensus statement?

2014-05-18 Thread Sebastian Schelter
I think it is important to formulate such a statement and send it out 
the "outside world". But we should focus the discussion. I suggest we 
start with a specific draft that someone prepares (maybe Ted as he 
started the thread) and then we can discuss and reformulate the 
individual sentences. I also think the formulation "the committers work 
on Spark" is not concise enough (and neglects a lot of our goals), but I 
also don't think it was meant to be part of an official statement in 
that exact wording.


--sebastian




On 05/18/2014 07:44 PM, Pat Ferrel wrote:

Not sure why you address this to me. I agree with most of your statements.

I think Ted’s intent was to find a simple consensus statement that addresses 
where the project is going in a general way. I look at it as something to 
communicate to the outside world. Why? We are rejecting new mapreduce code. 
This was announced as a project-wide rule and has already been used to reject 
one contribution I know of. OK, what replaces Hadoop mapreduce?  What therefore 
should contributors look to as a model if not Hadoop mapreduce? Do we give no 
advice or comment on this question?

For example, I’m doing drivers that read and write text files. This is quite 
tightly coupled to Spark. Possible contributors should know that this is OK, 
that it will not be rejected and is indeed where most of the engine specific 
work is being done by committers. You are right, most of us know what we are 
doing, but simply to say “no more mapreduce” without offering an alternative 
isn’t quite fair to everyone else.

You are abstracting your code away from a specific engine, and that is great, but in 
practice anyone running it currently must run Spark. This also needs to be 
communicated. It’s as practical as answering, “What do I need to install to make 
Mahout 1.0-snapshot work?"

On May 15, 2014, at 7:17 AM, Dmitriy Lyubimov  wrote:

Pat, it can't be as high-level or as dteailed as it can be, I don't care,
as long as it doesn't contain misstatements. It simply can state "we adhere
to the "Apache's power of doing" principle and accept new contributions".
This is ok with me. But, as offered, it does try to enumerate strategic
directions, and in doing so, its wording is either vague, or incomplete, or
just wrong.


For example, it says "it is clear that what the committers are working on
is Spark". This is less than accurate.

First, if I interpret it literally, it is wrong, as our committers for most
part are not working on Spark, and even if they do, to whatever negligible
degree it esxists, why Mahout would care.

Second, if it is meant to say "we develop algorithms for Spark", this is
also wrong, because whatever algorithms we have added to day, have 0 Spark
dependencies.

Third, if it is meant to say that majority of what we are working on is
Spark bindings, this is still incorrect. Head count-wise, Mahout-math
tweaks and Scala enablement were at least a big effort. Hadoop 2.0 stuff
was at least as big. Documentation and tutorial work engagement was
absolute leader headcount-wise to date.

The problem i am trying to explain here is that we obviously internally
know what we are doing; but this is for external consumption so we have to
be careful to avoid miscommunication here. It is easy for us to pass on
less than accurate info delivery exactly because we already know what we
are doing and therefore our brain is happy to jump to conclusions and make
up the missing connections between stated and implied as we see it. But for
an outsider, this would sound vague or make him make wrong connections.



On Wed, May 7, 2014 at 9:54 AM, Pat Ferrel  wrote:


This doesn’t seem to be a vision statement. I was +1 to a simple consensus
statement.

The vision is up to you.

We have an interactive shell that scales to huge datasets without
resorting to massive subsampling. One that allows you to deal with the
exact data your black box algos work on. Every data tool has an interactive
mode except Mahout—now it does.  Virtually every complex transform as well
as basic linear algebra works on massive datasets. The interactivity will
allow people to do things with Mahout they could never do before.

We also have the building blocks to make the fastest most flexible cutting
edge collaborative filtering+metadata recommenders in the world. Honestly I
don’t see anything like this elsewhere. We will also be able to fit into
virtually any workflow and directly consume data produced in those systems
with no intermediate scrubbing. This has never happened before in Mahout
and I don’t see it in MLlib either. Even the interactive shell will benefit
from this.

Other feature champions will be able to add to this list.

Seems like the vision comes from feature champions. I may not use Mahout
in the same way you do but I rely on your code. Maybe I serve a different
user type than you. I don’t see a pr

[jira] [Updated] (MAHOUT-1498) DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1498:
---

Resolution: Fixed
  Assignee: Sebastian Schelter
Status: Resolved  (was: Patch Available)

committed with a few cosmetic changes, thank you for the contribution!

> DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed 
> using oozie
> -
>
> Key: MAHOUT-1498
> URL: https://issues.apache.org/jira/browse/MAHOUT-1498
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
> Environment: mahout-core-0.7-cdh4.4.0.jar
>Reporter: Sergey
>Assignee: Sebastian Schelter
>  Labels: patch
> Fix For: 1.0
>
> Attachments: MAHOUT-1498.patch
>
>
> Hi, I get exception 
> {code}
> <<< Invocation of Main class completed <<<
> Failing Oozie Launcher, Main class 
> [org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles], main() threw 
> exception, Job failed!
> java.lang.IllegalStateException: Job failed!
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
> at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
> at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
> {code}
> The root cause is:
> {code}
> Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247
> {code}
> Looks like it happens because of 
> DictionaryVectorizer.makePartialVectors method.
> It has code:
> {code}
> DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);
> {code}
> which overrides jars pushed with job by oozie:
> {code}
> public static void More ...setCacheFiles(URI[] files, Configuration conf) {
>  String sfiles = StringUtils.uriToString(files);
>  conf.set("mapred.cache.files", sfiles);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1385) Caching Encoders don't cache

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1385:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I agree, Johannes is right that ideally we would want to leverage hashcode 
caching of Strings. But the current code is a non-working implementation, which 
this patch fixes. So I'm committing this for now.

> Caching Encoders don't cache
> 
>
> Key: MAHOUT-1385
> URL: https://issues.apache.org/jira/browse/MAHOUT-1385
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Johannes Schulte
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch
>
>
> The Caching... line of encoders contains code of caching the hash code terms 
> added to the vector. However, the method "hashForProbe" inside this classes 
> is never called as the signature has String for the parameter original form 
> (instead of byte[] like other encoders).
> Changing this to byte[] however would lose the java String internal caching 
> of the Strings hash code , that is used as a key in the cache map, triggering 
> another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1527:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed this with minor changes (removing a few typos, adding a check for 
MAHOUT_HOME to be set).

Thank you Andrew, keep up the outstanding work.

> Fix wikipedia classifier example
> 
>
> Key: MAHOUT-1527
> URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Documentation, Examples
>Affects Versions: 0.7, 0.8, 0.9
>    Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1543) JSON output format for classifying with random forests

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001088#comment-14001088
 ] 

Sebastian Schelter commented on MAHOUT-1543:


[~larryhu] I have trouble applying your patch to the sources checked out from 
SVN. Could you check that the patch is svn compatible? Sorry for the trouble.

> JSON output format for classifying with random forests
> --
>
> Key: MAHOUT-1543
> URL: https://issues.apache.org/jira/browse/MAHOUT-1543
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: larryhu
>  Labels: patch
> Fix For: 0.7
>
> Attachments: MAHOUT-1543.patch
>
>
> This patch adds JSON output format to build random forests, 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1542.


Resolution: Fixed

added to the website. I also added a new top navigation point called "Spark". 
Shout if you don't like that naming.

> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1485) Clean up Recommender Overview page

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001089#comment-14001089
 ] 

Sebastian Schelter commented on MAHOUT-1485:


[~yash...@gmail.com] yash, the documentation looks great. Could you create a 
markdown version of it, so that we can add it to the mahout website?

> Clean up Recommender Overview page
> --
>
> Key: MAHOUT-1485
> URL: https://issues.apache.org/jira/browse/MAHOUT-1485
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> Clean up the recommender overview page, remove outdated content and make sure 
> the examples work.
> https://mahout.apache.org/users/recommender/recommender-documentation.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1385) Caching Encoders don't cache

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001032#comment-14001032
 ] 

Sebastian Schelter commented on MAHOUT-1385:


[~awmanoj] whats the status here?

> Caching Encoders don't cache
> 
>
> Key: MAHOUT-1385
> URL: https://issues.apache.org/jira/browse/MAHOUT-1385
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Johannes Schulte
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch
>
>
> The Caching... line of encoders contains code of caching the hash code terms 
> added to the vector. However, the method "hashForProbe" inside this classes 
> is never called as the signature has String for the parameter original form 
> (instead of byte[] like other encoders).
> Changing this to byte[] however would lose the java String internal caching 
> of the Strings hash code , that is used as a key in the cache map, triggering 
> another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1528) Source tag and source release tarball for 0.9 don't exactly match

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1528.


Resolution: Later

Thank you for raising this issue, we will keep that in mind for the next 
release. Especially the CHANGELOG file should be part of the distribution!

> Source tag and source release tarball for 0.9 don't exactly match
> -
>
> Key: MAHOUT-1528
> URL: https://issues.apache.org/jira/browse/MAHOUT-1528
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.9
>Reporter: Mark Grover
>
> If you download the source tarball for the Apache Mahout 0.9 release, you'd 
> notice that it doesn't contain CHANGELOG or .gitignore file. However, if you 
> look at the tag for the release in the github repo 
> (https://github.com/apache/mahout/tree/mahout-0.9), you'd notice both the 
> files there.
> I think, both as a best practice and to make life of downstream integrators 
> less miserable, it would be fantastic if we could have the release tag in the 
> source match one to one with the source code in the released source tarball. 
> A test to do this in particular, would be awesome!
> Thanks in advance!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001021#comment-14001021
 ] 

Sebastian Schelter commented on MAHOUT-1388:


[~yxjiang] what's the status here?

> Add command line support and logging for MLP
> 
>
> Key: MAHOUT-1388
> URL: https://issues.apache.org/jira/browse/MAHOUT-1388
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 1.0
>Reporter: Yexi Jiang
>Assignee: Suneel Marthi
>  Labels: mlp, sgd
> Fix For: 1.0
>
> Attachments: Mahout-1388.patch, Mahout-1388.patch
>
>
> The user should have the ability to run the Perceptron from the command line.
> There are two programs to execute MLP, the training and labeling. The first 
> one takes the data as input and outputs the model, the second one takes the 
> model and unlabeled data as input and outputs the results.
> The parameters for training are as follows:
> 
> --input -i (input data)
> --skipHeader -sk // whether to skip the first row, this parameter is optional
> --labels -labels // the labels of the instances, separated by whitespace. 
> Take the iris dataset for example, the labels are 'setosa versicolor 
> virginica'.
> --model -mo  // in training mode, this is the location to store the model (if 
> the specified location has an existing model, it will update the model 
> through incremental learning), in labeling mode, this is the location to 
> store the result
> --update -u // whether to incremental update the model, if this parameter is 
> not given, train the model from scratch
> --output -o   // this is only useful in labeling mode
> --layersize -ls (no. of units per hidden layer) // use whitespace separated 
> number to indicate the number of neurons in each layer (including input layer 
> and output layer), e.g. '5 3 2'.
> --squashingFunction -sf // currently only supports Sigmoid
> --momentum -m 
> --learningrate -l
> --regularizationweight -r
> --costfunction -cf   // the type of cost function,
> 
> For example, train a 3-layer (including input, hidden, and output) MLP with 
> 0.1 learning rate, 0.1 momentum rate, and 0.01 regularization weight, the 
> parameter would be:
> mlp -i /tmp/training-data.csv -labels setosa versicolor virginica -o 
> /tmp/model.model -ls 5,3,1 -l 0.1 -m 0.1 -r 0.01
> This command would read the training data from /tmp/training-data.csv and 
> write the trained model to /tmp/model.model.
> The parameters for labeling is as follows:
> -
> --input -i // input file path
> --columnRange -cr // the range of column used for feature, start from 0 and 
> separated by whitespace, e.g. 0 5
> --format -f // the format of input file, currently only supports csv
> --model -mo // the file path of the model
> --output -o // the output path for the results
> -
> If a user need to use an existing model, it will use the following command:
> mlp -i /tmp/unlabel-data.csv -m /tmp/model.model -o /tmp/label-result
> Moreover, we should be providing default values if the user does not specify 
> any. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1536) Update "Creating vectors from text" page

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001036#comment-14001036
 ] 

Sebastian Schelter commented on MAHOUT-1536:


[~Andrew_Palumbo] did you have time work on this yet?

> Update "Creating vectors from text" page
> 
>
> Key: MAHOUT-1536
> URL: https://issues.apache.org/jira/browse/MAHOUT-1536
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
>Priority: Minor
> Fix For: 1.0
>
> Attachments: MAHOUT-1536_edit1.patch
>
>
> At least the seq2sparse section of the "Creating vectors from text" page is 
> out of date.  
> https://mahout.apache.org/users/basics/creating-vectors-from-text.html
>   



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1484) Spectral algorithm for HMMs

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1484.


Resolution: Won't Fix

no activity in four weeks

> Spectral algorithm for HMMs
> ---
>
> Key: MAHOUT-1484
> URL: https://issues.apache.org/jira/browse/MAHOUT-1484
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Emaad Manzoor
>Priority: Minor
>
> Following up with this 
> [comment|https://issues.apache.org/jira/browse/MAHOUT-396?focusedCommentId=12898284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12898284]
>  by [~isabel] on the sequential HMM 
> [proposal|https://issues.apache.org/jira/browse/MAHOUT-396], is there any 
> interest in a spectral algorithm as described in: "A spectral algorithm for 
> learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)"?
> I would like to take up this effort.
> This will enable learning the parameters of and making predictions with a HMM 
> in a single step. At its core, the algorithm involves computing estimates 
> from triples of observations, performing an SVD and then some matrix 
> multiplications.
> This could also form the base for an implementation of "Hilbert Space 
> Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, 
> A. Smola)".



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Build failed in Jenkins: Mahout-Quality #2608

2014-05-18 Thread Sebastian Schelter

Does someone have to check why the build is still failing?

On 05/13/2014 01:14 AM, Apache Jenkins Server wrote:

See 

--
[...truncated 8432 lines...]
}

Q=
{
   0  => {0:0.40273861426601687,1:-0.9153150324187648}
   1  => {0:0.9153150324227656,1:0.40273861426427493}
}
- C = A %*% B mapBlock {}
- C = A %*% B incompatible B keys
36495 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- C = At %*% B , join
37989 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- C = At %*% B , join, String-keyed
39452 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are identically 
distributed, performing row-wise zip.
- C = At %*% B , zippable, String-keyed
{
   2  => {0:62.0,1:86.0,3:132.0,2:115.0}
   1  => {0:50.0,1:69.0,3:105.0,2:92.0}
   3  => {0:74.0,1:103.0,3:159.0,2:138.0}
   0  => {0:26.0,1:35.0,3:51.0,2:46.0}
}
- C = A %*% inCoreB
{
   0  => {0:26.0,1:35.0,2:46.0,3:51.0}
   1  => {0:50.0,1:69.0,2:92.0,3:105.0}
   2  => {0:62.0,1:86.0,2:115.0,3:132.0}
   3  => {0:74.0,1:103.0,2:138.0,3:159.0}
}
- C = inCoreA %*%: B
43683 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
- C = A.t %*% A
45370 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying non-slim non-graph A'A.
70680 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings  - test done.
- C = A.t %*% A fat non-graph
71986 [ScalaTest-main-running-RLikeDrmOpsSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
- C = A.t %*% A non-int key
- C = A + B
- C = A + B side test 1
- C = A + B side test 2
- C = A + B side test 3
ArrayBuffer(0, 1, 2, 3, 4)
ArrayBuffer(0, 1, 2, 3, 4)
- general side
- Ax
- A'x
- colSums, colMeans
Run completed in 1 minute, 31 seconds.
Total number of tests run: 38
Suites: completed 9, aborted 0
Tests: succeeded 38, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:remove-project-artifact 
(remove-old-mahout-artifacts) @ mahout-spark ---
[INFO] /home/jenkins/.m2/repository/org/apache/mahout/mahout-spark removed.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO]
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ 
mahout-spark ---
[INFO] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO]
[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ mahout-spark 
---
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.jar
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/pom.xml to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT.pom
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-tests.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-tests.jar
[INFO] Installing 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/mahout-spark-1.0-SNAPSHOT-sources.jar
 to 
/home/jenkins/.m2/repository/org/apache/mahout/mahout-spark/1.0-SNAPSHOT/mahout-spark-1.0-SNAPSHOT-sources.jar
[INFO]
[INFO] >>> maven-javadoc-plugin:2.9.1:javadoc (default-cli) @ mahout-spark >>>
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:add-source (add-source) @ mahout-spark 
---
[INFO] Source directory: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/generated-sources/mahout
 added.
[INFO]
[INFO] --- build-helper-maven-plugin:1.8:add-test-source (add-test-source) @ 
mahout-spark ---
[INFO] Test Source directory: 
/x1/jenkins/jenkins-slave/workspace/Mahout-Quality/trunk/spark/target/generated-test-sources/mahout
 added.
[INFO]
[INFO] <<< maven-javadoc-plugin:2.9.1:javadoc (defau

[jira] [Commented] (MAHOUT-1549) Extracting tfidf-vectors by key

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001030#comment-14001030
 ] 

Sebastian Schelter commented on MAHOUT-1549:


[~Pilgrim] has your question been answered yet?

> Extracting tfidf-vectors by key
> ---
>
> Key: MAHOUT-1549
> URL: https://issues.apache.org/jira/browse/MAHOUT-1549
> Project: Mahout
>  Issue Type: Question
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Richard Scharrer
>  Labels: documentation, features, newbie
>
> Hi,
> I have about 20 tfidf-vectors and I need to extract 500 of them of which 
> I have the keys. Is there some kind of magical option which allows me 
> something like taking the output of mahout seqdumper and transform it back 
> into a sequencefile that I can use for trainnb /testnb? The sequencefiles of 
> tfidf use the Text class for the keys and the VectorWritable class for the 
> values. I tried 
> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
> with different settings but the output always gives me the Text class for 
> both, key and value which can't be used in trainnb and testnb.
> I posted this question on:
> http://stackoverflow.com/questions/23502362/extracting-tfidf-vectors-by-key-without-destroying-the-fileformat
> I ask this question in here because I've seen similar questions on 
> stackoverflow that where asked last year and still didn't get an answer
> I really need this information so in case you know anything please tell me.
> Regards,
> Richard



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1425) SGD classifier example with bank marketing dataset

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001024#comment-14001024
 ] 

Sebastian Schelter commented on MAHOUT-1425:


[~frankscholten] what's the status here?

> SGD classifier example with bank marketing dataset
> --
>
> Key: MAHOUT-1425
> URL: https://issues.apache.org/jira/browse/MAHOUT-1425
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.0
>Reporter: Frank Scholten
>Assignee: Frank Scholten
> Fix For: 1.0
>
> Attachments: MAHOUT-1425.patch
>
>
> As discussed on the mailing list a few weeks back I started working on an SGD 
> classifier example with the bank marketing dataset from UCI: 
> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
> See https://github.com/frankscholten/mahout-sgd-bank-marketing
> Ted has also made further changes that were very useful so I suggest to add 
> this example to Mahout
> Ted: can you tell a bit more about the log transforms? Some of them are just 
> Math.log while others are more complex expressions. 
> What else is needed to contribute it to Mahout? Anything that could improve 
> the example?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1495) Create a website describing the distributed item-based recommender

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001017#comment-14001017
 ] 

Sebastian Schelter commented on MAHOUT-1495:


[~apsaltis] what's the status here?

> Create a website describing the distributed item-based recommender
> --
>
> Key: MAHOUT-1495
> URL: https://issues.apache.org/jira/browse/MAHOUT-1495
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Documentation
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1487) More understandable error message when attempt to use wrong FileSystem

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1487.


Resolution: Won't Fix

no activity in four weeks

> More understandable error message when attempt to use wrong FileSystem
> --
>
> Key: MAHOUT-1487
> URL: https://issues.apache.org/jira/browse/MAHOUT-1487
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.9
> Environment: Amazon S3, Amazon EMR, Local file system
>Reporter: Konstantin
>Priority: Trivial
> Fix For: 1.0
>
>
> RandomSeedGenerator has following code:
> FileSystem fs = FileSystem.get(output.toUri(), conf);
> ...
> fs.getFileStatus(input).isDir() 
> If specify output path correctly and input path not correctly, Mahout throws 
> not well understandable error message. "Exception in thread "main" 
> java.lang.IllegalArgumentException: This file system object 
> (hdfs://172.31.41.65:9000) does not support access to the request path 
> 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors' 
> You possibly called FileSystem.get(conf) when you should have called 
> FileSystem.get(uri, conf) to obtain a file system supporting your path"
> This happens because FileSystem object was created from output path, and 
> getFileStatus has parameter for input path. This caused misunderstanding when 
> try to understand what error message means.
> To prevent this misunderstanding, I propose to improve error message adding 
> following details:
> 1. Specify which filesystem type used (DistributedFileSystem, 
> NativeS3FileSystem, etc. using fs.getClass().getName())
> 2. Then specify which path can not be processed correctly.
> This can be done by validation utility which can be applied to many places in 
> Mahout. When we use Mahout we need to specify many paths and we also can use 
> many types of file systems: local for debugging, distributed on Hadoop, and 
> s3 on Amazon. In this case better error messages can save much time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1453) ImplicitFeedbackAlternatingLeastSquaresSolver add support for user supplied confidence functions

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1453.


Resolution: Won't Fix

no activity in four weeks

> ImplicitFeedbackAlternatingLeastSquaresSolver add support for user supplied 
> confidence functions
> 
>
> Key: MAHOUT-1453
> URL: https://issues.apache.org/jira/browse/MAHOUT-1453
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Adam Ilardi
>Assignee: Sebastian Schelter
>Priority: Minor
>  Labels: newbie, patch, performance
> Fix For: 1.0
>
>
> double confidence(double rating) {
> return 1 + alpha * rating;
>   }
> The original paper mentions other functions that could be used as well. @ the 
> moment It's not easy for a user to change this without compiling the source.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1427) Convert old .mapred API to new .mapreduce

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001034#comment-14001034
 ] 

Sebastian Schelter commented on MAHOUT-1427:


[~smarthi] what's the status here?

> Convert old .mapred API to new .mapreduce
> -
>
> Key: MAHOUT-1427
> URL: https://issues.apache.org/jira/browse/MAHOUT-1427
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, Integration
>Affects Versions: 0.9
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 1.0
>
> Attachments: Mahout-1427.patch
>
>
> Migrate code still using the old .mapred to .mapreduce API



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1470) Topic dump

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001035#comment-14001035
 ] 

Sebastian Schelter commented on MAHOUT-1470:


[~andrew.musselman] what's the status here?

> Topic dump
> --
>
> Key: MAHOUT-1470
> URL: https://issues.apache.org/jira/browse/MAHOUT-1470
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 1.0
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 1.0
>
>
> Per 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCAMc_qaL2DCgbVbam2miNsLpa4qvaA9sMy1-arccF9Nz6ApcsvQ%40mail.gmail.com%3E
> > The script needs to be corrected to not call vectordump for LDA as
> > vectordump utility (or even clusterdump) are presently not capable of
> > displaying topics and relevant documents. I recall this issue was
> > previously reported by Peyman Faratin post 0.9 release.
> >
> > Mahout's missing a clusterdump utility that reads in LDA
> > topics, Document - DocumentId mapping and displays a report of the topics
> > and the documents that belong to a topic.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1545) Creating holdout sets with seq2sparse and split

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1545.


   Resolution: Later
Fix Version/s: 1.0

Closing this as it is a reminder for things to do in the future.

> Creating holdout sets with seq2sparse and split
> ---
>
> Key: MAHOUT-1545
> URL: https://issues.apache.org/jira/browse/MAHOUT-1545
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Examples
>Affects Versions: 0.9
>Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> The current method for vectorizing data using seq2sparse and then "split" 
> allows for a large amount of information to spill over from the training sets 
> to the test sets- especially in the case of TF-IDF transformations.  The IDF 
> transform provides alot of information on the holdout set to the training set 
> if calculated previous to splitting them up.  
> I'm not sure if given the current seq2sparse implementation's status as 
> Legacy and the relatively minor advantages that it might give whether or not 
> its worth adding something like a "split" option to 
> SparseVectorsFromSequenceFiles.java.  But i know that i saw a new 
> implementation being discussed and and think that it would be worth it to 
> have an option like this built in.
> I think that this issue may have been raised before, but i wanted to bring it 
> up again in light of the current move away from MapReduce and the new 
> implementations of Mahout tools that will be coming along. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1514) Contact the original Random Forest author

2014-05-18 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1514.


Resolution: Won't Fix

no answer in four weeks

> Contact the original Random Forest author
> -
>
> Key: MAHOUT-1514
> URL: https://issues.apache.org/jira/browse/MAHOUT-1514
> Project: Mahout
>  Issue Type: Task
>    Reporter: Sebastian Schelter
>Priority: Critical
> Fix For: 1.0
>
>
> We should contact the original Random Forest author to ask about maintenance 
> of the implementation. Otherwise, this becomes a candidate for removal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1522) Handle logging levels via log4j.xml

2014-05-18 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001010#comment-14001010
 ] 

Sebastian Schelter commented on MAHOUT-1522:


[~andrew.musselman] whats the status here?

> Handle logging levels via log4j.xml
> ---
>
> Key: MAHOUT-1522
> URL: https://issues.apache.org/jira/browse/MAHOUT-1522
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Andrew Musselman
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 1.0
>
>
> We don't have a properties file to tell log4j what to do, so we inherit other 
> frameworks' settings.
> Suggestion is to add a log4j.xml file in a canonical place and set up logging 
> levels, maybe separating out components for ease of setting levels during 
> debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   3   4   5   6   7   8   9   10   >