[jira] [Comment Edited] (MAHOUT-1490) Data frame R-like bindings

2014-05-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988750#comment-13988750
 ] 

Saikat Kanjilal edited comment on MAHOUT-1490 at 5/4/14 5:22 AM:
-

Added mltable operator functionality into integration API section, also added 
initial section around dplyr operations that could be useful within the context 
of a dataframe.


was (Author: kanjilal):
Added mltable operator functionality into integration API section

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

2014-05-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988916#comment-13988916
 ] 

Saikat Kanjilal commented on MAHOUT-1541:
-

Pat,
Just one comment on the "no-progress around the dataframes JIRA", I assume you 
are referring to 1490, there is indeed quite a bit of progress presenting APIs 
around a set of generic operations around a dataFrame, based on Dmitry's 
recommendation I took the path of creating a proposal rather than blasting off 
and writing code to do this and have that be heavily criticized and not meeting 
the committable expectations, this way the design will be in place and have 
general consensus before any coding efforts begin, I'd love to get feedback 
from you and others to move 1490 along, please see blog and comment on JIRA if 
you'd like.

Regards

> Create CLI Driver for Spark Cooccurrence Analysis
> -
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: Helping out on spark efforts

2014-05-03 Thread Saikat Kanjilal
Me again :), added a subset of the definitions from the dplyr functionality to 
the integration API section as promised , examples include compute/filter/chain 
etc.My next steps will be adding concrete examples underneath each of the 
newly created Integration APIs, at a high level here are the domain objects I 
am thinking will need to exist and be referenced in the DataFrame world:
DataFrame (self explanatory)Query (a generalized abstraction around a 
query==could represent a sql/nosql query or an hdfs query)RDD (important domain 
object that could be returned by one or more of our APIs)Destination (a remote 
data source, could be a table/ a location in hdfs etc)Connection (a remote 
database connection to use to perform transactional operations between the 
dataFrame and a remote database)
Had an additional thought, might we at some point want to operate on matrices 
and mathematically perform operations with matrices and dataFrames, would love 
to hear from committers as to whether this may be useful and I can add in APIs 
around this as well.
One thing that I've also been pondering is whether or how to handle errors in 
any of these APIs, one thought I had was to introduce a generalized error 
object that can be reused on all of the APIs, maybe something that contains a 
message and an error code or something similar, an alternative idea is to 
leverage something already existing in the spark bindings if possible.
Would love for folks to take a look through the APIs as I expand them and add 
more examples and leave comments on JIRA ticket, also I'm thinking that since 
the stuff around performing slicing/CRUD functionality around dataFrames is 
pretty commonly understood I may take those examples out and put more examples 
in around the APIs for dplyr and mltables.
Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490

Regards



> From: sxk1...@hotmail.com
> To: dev@mahout.apache.org
> Subject: RE: Helping out on spark efforts
> Date: Sat, 3 May 2014 10:09:51 -0700
> 
> I've taken a stab at adding a subset of the functionality used by MLTable 
> operators into the blog on top of the R CRUD functionality I listed earlier 
> into the integration API section of the blog, please review and let me know 
> your thoughts, will be tackling the dplyr functionality next and adding that 
> in , blog is shown below, again please see the integration API section for 
> details:
> 
> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> 
> Look forward to hearing comments either on the list on the jira ticket itself:
> https://issues.apache.org/jira/browse/MAHOUT-1490
> Thanks in advance.
> 
> > Date: Wed, 30 Apr 2014 17:13:52 +0200
> > From: s...@apache.org
> > To: ted.dunn...@gmail.com; dev@mahout.apache.org
> > Subject: Re: Helping out on spark efforts
> > 
> > I think getting the design right for MAHOUT-1490 is tough. Dmitriy 
> > suggested to update the design example to Scala code and try to work in 
> > things that fit from dply from R and MLTable. I'd love to see such a 
> > design doc.
> > 
> > --sebastian
> > 
> > On 04/30/2014 05:02 PM, Ted Dunning wrote:
> > > +1 for foundations first.
> > >
> > > There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> > > regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter  
> > > wrote:
> > >
> > >> I think you should concentrate on MAHOUT-1490, that is a highly important
> > >> task that will be the foundation for a lot of stuff to be built on top.
> > >> Let's focus on getting this thing right and then move on to other things.
> > >>
> > >> --sebastian
> > >>
> > >>
> > >> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> > >>
> > >>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> > >>> see other algorithms in mahout that are talked about being ported to 
> > >>> spark,
> > >>> I was wondering if there's any interest/need in porting or writing 
> > >>> things
> > >>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> > >>> working on 1490.  Also are we planning to port the distributed versions 
> > >>> of
> > >>> taste to use spark as well at some point.
> > >>> Thanks in advance.
> > >>>
> > >>>
> > >>
> > >
> > 
> 
  

[jira] [Resolved] (MAHOUT-1518) Preprocessing for collaborative filtering with the Scala DSL

2014-05-03 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1518.


Resolution: Fixed

Work on this is being generalized to support other job types and flexible 
import/export.  A version for cooccurrence is in MAHOUT-1541

> Preprocessing for collaborative filtering with the Scala DSL
> 
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

2014-05-03 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988768#comment-13988768
 ] 

Pat Ferrel commented on MAHOUT-1541:


Good, the only reason to do this, that I can think of, is so it will fit in 
existing workflows and the legacy code fits there already.

The first version of this will be for text delimited import/export, I assume 
that other formats may be nice, like JSON, or others. Any guidance here would 
be appreciated. 

> Create CLI Driver for Spark Cooccurrence Analysis
> -
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1490) Data frame R-like bindings

2014-05-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988750#comment-13988750
 ] 

Saikat Kanjilal edited comment on MAHOUT-1490 at 5/3/14 5:10 PM:
-

Added mltable operator functionality into integration API section


was (Author: kanjilal):
Added mtable operator functionality into integration API section

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988750#comment-13988750
 ] 

Saikat Kanjilal commented on MAHOUT-1490:
-

Added mtable operator functionality into integration API section

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: Helping out on spark efforts

2014-05-03 Thread Saikat Kanjilal
I've taken a stab at adding a subset of the functionality used by MLTable 
operators into the blog on top of the R CRUD functionality I listed earlier 
into the integration API section of the blog, please review and let me know 
your thoughts, will be tackling the dplyr functionality next and adding that in 
, blog is shown below, again please see the integration API section for details:

http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html

Look forward to hearing comments either on the list on the jira ticket itself:
https://issues.apache.org/jira/browse/MAHOUT-1490
Thanks in advance.

> Date: Wed, 30 Apr 2014 17:13:52 +0200
> From: s...@apache.org
> To: ted.dunn...@gmail.com; dev@mahout.apache.org
> Subject: Re: Helping out on spark efforts
> 
> I think getting the design right for MAHOUT-1490 is tough. Dmitriy 
> suggested to update the design example to Scala code and try to work in 
> things that fit from dply from R and MLTable. I'd love to see such a 
> design doc.
> 
> --sebastian
> 
> On 04/30/2014 05:02 PM, Ted Dunning wrote:
> > +1 for foundations first.
> >
> > There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> > regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> >
> >
> >
> > On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter  wrote:
> >
> >> I think you should concentrate on MAHOUT-1490, that is a highly important
> >> task that will be the foundation for a lot of stuff to be built on top.
> >> Let's focus on getting this thing right and then move on to other things.
> >>
> >> --sebastian
> >>
> >>
> >> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> >>
> >>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> >>> see other algorithms in mahout that are talked about being ported to 
> >>> spark,
> >>> I was wondering if there's any interest/need in porting or writing things
> >>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> >>> working on 1490.  Also are we planning to port the distributed versions of
> >>> taste to use spark as well at some point.
> >>> Thanks in advance.
> >>>
> >>>
> >>
> >
> 
  

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

2014-05-03 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988743#comment-13988743
 ] 

Ted Dunning commented on MAHOUT-1541:
-

{quote}
BTW do we really need to support sequencefiles where the legacy code does?
{quote}

I sincerely hope not.

> Create CLI Driver for Spark Cooccurrence Analysis
> -
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

2014-05-03 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988735#comment-13988735
 ] 

Pat Ferrel edited comment on MAHOUT-1541 at 5/3/14 4:21 PM:


Agreed (mostly), only the CLI is custom for each algo.

The preprocessor was a remnant of your old example patch and isn't meant to be 
repeated. Not planning to have separate code for every algo at all, in fact it 
should be quite the opposite. There will be a custom CLI for each algo and one 
of a couple customizable but general purpose importer/exporters (text 
delimited, sequencefile?) with some method of specifying input and output 
schema. 

The IndexedDataset would be identical in structure in all cases. Should have 
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing 
to merge them with some other dataframe in the future. 

What I am doing is exactly what we agreed to in MAHOUT-1518 There is another 
Jira about dataframes but I wasn't aware of any progress made on it. Don't want 
to "wait" I only have limited time in windows, if I wait I may get nothing 
done. And I could use this right now to rebuild the solr recommender and the 
other Mahout recommenders. This work seems at worse independant of some other 
r-like dataframe, or a best can be integrated as that solidifies.

In the meantime any suggestions about using another effort like some usable 
dataframe-ish object is fine. I had though we'd convinced ourselves that the 
needs of an r-like dataframe and an import/export IndexedDataset were too 
different. Dmitriy certainly made strong arguments to that effect.

Just using the cooccurrence analysis to have an end to end example.

BTW do we really need to support sequencefiles where the legacy code does?



was (Author: pferrel):
Agreed (mostly), only the CLI is custom for each algo.

The preprocessor was a remnants of your old example patch. Not planning to have 
separate code for every algo at all, in fact it should be quite the opposite. 
There will be a custom CLI for each algo and one of a couple customizable but 
general purpose importer/exporters (text delimited, sequencefile?) with some 
method of specifying input and output schema. 

The IndexedDataset would be identical in structure in all cases. Should have 
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing 
to merge them with some other dataframe in the future. 

What I am doing is exactly what we agreed to in MAHOUT-1518 There is another 
Jira about dataframes but I wasn't aware of any progress made on it. Don't want 
to "wait" I only have limited time in windows, if I wait I may get nothing 
done. And I could use this right now to rebuild the solr recommender and the 
other Mahout recommenders. This work seems at worse independant of some other 
r-like dataframe, or a best can be integrated as that solidifies.

In the meantime any suggestions about using another effort like some usable 
dataframe-ish object is fine. I had though we'd convinced ourselves that the 
needs of an r-like dataframe and an import/export IndexedDataset were too 
different. Dmitriy certainly made strong arguments to that effect.

Just using the cooccurrence analysis to have an end to end example.

BTW do we really need to support sequencefiles where the legacy code does?


> Create CLI Driver for Spark Cooccurrence Analysis
> -
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

2014-05-03 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988735#comment-13988735
 ] 

Pat Ferrel commented on MAHOUT-1541:


Agreed (mostly), only the CLI is custom for each algo.

The preprocessor was a remnants of your old example patch. Not planning to have 
separate code for every algo at all, in fact it should be quite the opposite. 
There will be a custom CLI for each algo and one of a couple customizable but 
general purpose importer/exporters (text delimited, sequencefile?) with some 
method of specifying input and output schema. 

The IndexedDataset would be identical in structure in all cases. Should have 
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing 
to merge them with some other dataframe in the future. 

What I am doing is exactly what we agreed to in MAHOUT-1518 There is another 
Jira about dataframes but I wasn't aware of any progress made on it. Don't want 
to "wait" I only have limited time in windows, if I wait I may get nothing 
done. And I could use this right now to rebuild the solr recommender and the 
other Mahout recommenders. This work seems at worse independant of some other 
r-like dataframe, or a best can be integrated as that solidifies.

In the meantime any suggestions about using another effort like some usable 
dataframe-ish object is fine. I had though we'd convinced ourselves that the 
needs of an r-like dataframe and an import/export IndexedDataset were too 
different. Dmitriy certainly made strong arguments to that effect.

Just using the cooccurrence analysis to have an end to end example.

BTW do we really need to support sequencefiles where the legacy code does?


> Create CLI Driver for Spark Cooccurrence Analysis
> -
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Updated] (MAHOUT-1428) Recommending already consumed items

2014-05-03 Thread dodi hakim
Sorry, where is it?  is it in the submit patch link? it's gone now


On 4 May 2014 00:27, Sebastian Schelter  wrote:

> You have to click on More > Attach Files.
>
> Best,
> Sebastian
>
>
> On 05/03/2014 04:25 PM, dodi hakim wrote:
>
>> How to submit a patch for this?
>>
>>
>> On 4 May 2014 00:21, Anonymous (JIRA)  wrote:
>>
>>
>>>   [
>>> https://issues.apache.org/jira/browse/MAHOUT-1428?page=
>>> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>
>>> Anonymous updated MAHOUT-1428:
>>> --
>>>
>>> Labels: easyfix  (was: )
>>>  Affects Version/s: 1.0
>>> Status: Patch Available  (was: Reopened)
>>>
>>>  Recommending already consumed items
 ---

  Key: MAHOUT-1428
  URL: https://issues.apache.org/jira/browse/MAHOUT-1428
  Project: Mahout
   Issue Type: Bug
   Components: Collaborative Filtering
 Affects Versions: 1.0
 Reporter: Mario Levitin
   Labels: easyfix
  Fix For: 1.0


 Mahout does not recommend items which are already consumed by the user.
 For example,
 In the getAllOtherItems method of GenericUserBasedRecommender class

>>> there is the following line
>>>
 possibleItemIDs.removeAll(dataModel.getItemIDsFromUser(theUserID));
 which removes user's items from the possibleItemIDs to prevent these

>>> items from being recommended to the user. This is ok for many
>>> recommendation cases but for many other cases it is not.
>>>
 The Recommender classes  (I mean all of them, NN-based and SVD-based as

>>> well as hadoop and non-hadoop versions) might have a parameter for this
>>> for
>>> excluding or not excluding user items in the returned recommendations.
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.2#6252)
>>>
>>>
>>
>>
>>
>


-- 
Best Regards,

Dodi Amar Hakim


Re: [jira] [Updated] (MAHOUT-1428) Recommending already consumed items

2014-05-03 Thread Sebastian Schelter

You have to click on More > Attach Files.

Best,
Sebastian

On 05/03/2014 04:25 PM, dodi hakim wrote:

How to submit a patch for this?


On 4 May 2014 00:21, Anonymous (JIRA)  wrote:



  [
https://issues.apache.org/jira/browse/MAHOUT-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

Anonymous updated MAHOUT-1428:
--

Labels: easyfix  (was: )
 Affects Version/s: 1.0
Status: Patch Available  (was: Reopened)


Recommending already consumed items
---

 Key: MAHOUT-1428
 URL: https://issues.apache.org/jira/browse/MAHOUT-1428
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 1.0
Reporter: Mario Levitin
  Labels: easyfix
 Fix For: 1.0


Mahout does not recommend items which are already consumed by the user.
For example,
In the getAllOtherItems method of GenericUserBasedRecommender class

there is the following line

possibleItemIDs.removeAll(dataModel.getItemIDsFromUser(theUserID));
which removes user's items from the possibleItemIDs to prevent these

items from being recommended to the user. This is ok for many
recommendation cases but for many other cases it is not.

The Recommender classes  (I mean all of them, NN-based and SVD-based as

well as hadoop and non-hadoop versions) might have a parameter for this for
excluding or not excluding user items in the returned recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)









Re: [jira] [Updated] (MAHOUT-1428) Recommending already consumed items

2014-05-03 Thread dodi hakim
How to submit a patch for this?


On 4 May 2014 00:21, Anonymous (JIRA)  wrote:

>
>  [
> https://issues.apache.org/jira/browse/MAHOUT-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Anonymous updated MAHOUT-1428:
> --
>
>Labels: easyfix  (was: )
> Affects Version/s: 1.0
>Status: Patch Available  (was: Reopened)
>
> > Recommending already consumed items
> > ---
> >
> > Key: MAHOUT-1428
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1428
> > Project: Mahout
> >  Issue Type: Bug
> >  Components: Collaborative Filtering
> >Affects Versions: 1.0
> >Reporter: Mario Levitin
> >  Labels: easyfix
> > Fix For: 1.0
> >
> >
> > Mahout does not recommend items which are already consumed by the user.
> > For example,
> > In the getAllOtherItems method of GenericUserBasedRecommender class
> there is the following line
> > possibleItemIDs.removeAll(dataModel.getItemIDsFromUser(theUserID));
> > which removes user's items from the possibleItemIDs to prevent these
> items from being recommended to the user. This is ok for many
> recommendation cases but for many other cases it is not.
> > The Recommender classes  (I mean all of them, NN-based and SVD-based as
> well as hadoop and non-hadoop versions) might have a parameter for this for
> excluding or not excluding user items in the returned recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>



-- 
Best Regards,

Dodi Amar Hakim


[jira] [Updated] (MAHOUT-1428) Recommending already consumed items

2014-05-03 Thread Anonymous (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anonymous updated MAHOUT-1428:
--

   Labels: easyfix  (was: )
Affects Version/s: 1.0
   Status: Patch Available  (was: Reopened)

> Recommending already consumed items
> ---
>
> Key: MAHOUT-1428
> URL: https://issues.apache.org/jira/browse/MAHOUT-1428
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 1.0
>Reporter: Mario Levitin
>  Labels: easyfix
> Fix For: 1.0
>
>
> Mahout does not recommend items which are already consumed by the user.
> For example,
> In the getAllOtherItems method of GenericUserBasedRecommender class there is 
> the following line
> possibleItemIDs.removeAll(dataModel.getItemIDsFromUser(theUserID));  
> which removes user's items from the possibleItemIDs to prevent these items 
> from being recommended to the user. This is ok for many recommendation cases 
> but for many other cases it is not. 
> The Recommender classes  (I mean all of them, NN-based and SVD-based as well 
> as hadoop and non-hadoop versions) might have a parameter for this for 
> excluding or not excluding user items in the returned recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1446) Create an intro for matrix factorization

2014-05-03 Thread Sebastian Schelter
No, that list is complete
Am 03.05.2014 15:31 schrieb "jian wang (JIRA)" :

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988686#comment-13988686]
>
> jian wang commented on MAHOUT-1446:
> ---
>
> The factorizers till now i have found out include:
>
> RatingSGDFactorizer, SVDPlusPlusFactorizer, ParallelSGDFactorizer and
> ALSWRFactorizer?
>
> Do I miss any?
>
> > Create an intro for matrix factorization
> > 
> >
> > Key: MAHOUT-1446
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1446
> > Project: Mahout
> >  Issue Type: New Feature
> >  Components: Documentation
> >Reporter: Maciej Mazur
> > Fix For: 1.0
> >
> >
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Commented] (MAHOUT-1446) Create an intro for matrix factorization

2014-05-03 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988686#comment-13988686
 ] 

jian wang commented on MAHOUT-1446:
---

The factorizers till now i have found out include:

RatingSGDFactorizer, SVDPlusPlusFactorizer, ParallelSGDFactorizer and 
ALSWRFactorizer?

Do I miss any?

> Create an intro for matrix factorization
> 
>
> Key: MAHOUT-1446
> URL: https://issues.apache.org/jira/browse/MAHOUT-1446
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Maciej Mazur
> Fix For: 1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1441) Add documentation for Spectral KMeans to Mahout Website

2014-05-03 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988666#comment-13988666
 ] 

Shannon Quinn commented on MAHOUT-1441:
---

If no one has any objections in the next couple of days, I can close this 
ticket.

> Add documentation for Spectral KMeans to Mahout Website
> ---
>
> Key: MAHOUT-1441
> URL: https://issues.apache.org/jira/browse/MAHOUT-1441
> Project: Mahout
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.0
>Reporter: Suneel Marthi
>Assignee: Shannon Quinn
> Fix For: 1.0
>
> Attachments: MAHOUT-1441.diff
>
>
> Need to update the Website with Design, user guide and any relevant 
> documentation for Spectral KMeans clustering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1537) minor fixes to spark-shell

2014-05-03 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988653#comment-13988653
 ] 

Sebastian Schelter commented on MAHOUT-1537:


works for me now as well in Ubuntu, thanks Anand

> minor fixes to spark-shell
> --
>
> Key: MAHOUT-1537
> URL: https://issues.apache.org/jira/browse/MAHOUT-1537
> Project: Mahout
>  Issue Type: Bug
>Reporter: Anand Avati
>Assignee: Dmitriy Lyubimov
>Priority: Minor
> Fix For: 1.0
>
> Attachments: 0001-spark-shell-unclutter-terminal-on-exit.patch, 
> 0001-spark-shell-unclutter-terminal-on-exit.patch, 
> 0001-spark-shell-unclutter-terminal-on-exit.patch, 
> 0002-spark-shell-aesthetic-fix.patch
>
>
> Terminal clutters up after exiting spark shell (as terminal settings are 
> changed within spark repl).
> Save and restore system terminal settings to avoid clutter.
> Also minor fix of prompt styling



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-03 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988639#comment-13988639
 ] 

Sebastian Schelter commented on MAHOUT-1542:


Updated tutorial to also mention caching.

> Tutorial for playing with Mahout's Spark shell
> --
>
> Key: MAHOUT-1542
> URL: https://issues.apache.org/jira/browse/MAHOUT-1542
> Project: Mahout
>  Issue Type: Improvement
>  Components: Documentation, Math
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
>
> I have a created a tutorial for setting up the spark shell and implementing a 
> simple linear regression algorithm. I'd love to make this part of the 
> website, could someone give it a review?
> https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md
> PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
> to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-03 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988626#comment-13988626
 ] 

Sebastian Schelter commented on MAHOUT-1529:


I just know that it was discussed during their graduation. We could 
simply ask on their mailinglist how they do it.

--sebastian




> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-03 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988621#comment-13988621
 ] 

Ted Dunning commented on MAHOUT-1529:
-

{quote}
BTW I truly envy the Spark process which is 100% handled by github pull 
requests. Does anybody knows how they manage to push it back to Apache from 
there?
{quote}

It would require that we switch over to git for the main mahout repo.

If we want to discuss that, we should move to a separate thread.

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAHOUT-1542) Tutorial for playing with Mahout's Spark shell

2014-05-03 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1542:
--

 Summary: Tutorial for playing with Mahout's Spark shell
 Key: MAHOUT-1542
 URL: https://issues.apache.org/jira/browse/MAHOUT-1542
 Project: Mahout
  Issue Type: Improvement
  Components: Documentation, Math
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
 Fix For: 1.0


I have a created a tutorial for setting up the spark shell and implementing a 
simple linear regression algorithm. I'd love to make this part of the website, 
could someone give it a review?

https://github.com/sscdotopen/krams/blob/master/linear-regression-cereals.md

PS: If you wanna try out the code, you have to add the patch from MAHOUT-1532 
to your sources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1532) Add solve() function to the Scala DSL

2014-05-03 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1532:
---

Attachment: MAHOUT-1532.patch

Updated patch according to Dmitriy's comments.

> Add solve() function to the Scala DSL 
> --
>
> Key: MAHOUT-1532
> URL: https://issues.apache.org/jira/browse/MAHOUT-1532
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1532.patch, MAHOUT-1532.patch
>
>
> We should add a solve() function to the Scala DSL with helps with solving Ax 
> = b for in-core matrices and vectors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)