Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-02-07 Thread Saikat Kanjilal
@Trevor Grant

The landscape in machine learning is getting more and more diluted with lots of 
tools, here's a question, given that some folks are taking R and connecting it 
to spark and map reduce to make the R algorithms work at scale 
(https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler) what would be the 
additional value added in porting the R code using the algorithms/samsara 
framework, to me the MRS efforts and the approach you are proposing are 2 
parallel tracks, as far as the barriers to entry to contributing I think its 
largely due to the complexity of the codebase and the lack of familiarity with 
Samsara, I'd love to help create some good docs/tutorials on both the 
algorithms framework and samsara when and where it makes sense, however I feel 
like it'd be useful to really identify the use cases where using the 
algorithms/samsara approach has clear wins versus MRS with spark or spark by 
itself or python/scikit-learn, I've found that in general people dont really 
need custom algorithms in datascience , they typically are answering some very 
basic classification or clustering question and can use linear/logistic 
regression or a variant of kmeans.   I'd also like to help dig into some use 
cases with Samsara and put those use cases maybe in the examples section.


Thoughts?

ScaleR Functions - 
msdn.microsoft.com
msdn.microsoft.com
The RevoScaleR package provides a set of over one hundred portable, scalable, 
and distributable data analysis functions. This topic presents a curated list 
...





From: Trevor Grant 
Sent: Tuesday, February 7, 2017 8:47 AM
To: user@mahout.apache.org; isa...@apache.org
Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

The idea that Andy briefly touched on, is that the Algorithm Framework
(hopefully) paves the way for R/CRAN like user contribution.

Increased contribution was a goal I had certainly hoped for.  I have begun
promoting the idea at Meetups.  There hasn't been a concerted effort to
push the idea, however it is a tagline / call to action I am planning on
pushing at talks and conferences this spring. Thank you for raising the
issue on the mailing list as well.

Using the Samsara framework and "Algorithms" framework, it is hoped the the
barrier to entry for new contributors will be very low, and that they can
introduce new algorithms or port them from R. Other 'Big Data' Machine
Learning frameworks suffer because they are not easily extensible.

The algorithms framework makes it (more) clear where a new algorithm would
go, and in general how it should behave. E.g. This is a Regressor, ok
probably goes in the regressor package- it needs a fit method that takes a
DrmX and a DrmY, and a predict method that takes DrmX and returns
DrmY_hat).  The algorithms framework also provides a consistent interface
across algorithms and puts up "guard rails" to ensure common things are
done in an efficient manner (e.g. Serializing just the model, not the
fitter and additional unneeded things, thank you Dmitriy). The Samsara
framework makes it easy to 'read' what the person is doing. This makes it
easier to review PRs, encourages community review, and if (hopefully not,
but in case it does happen) someone makes a so-called 'drive by commit',
that is commits an algorithm and is never heard of again, others can easily
understand and maintain the algorithm in the persons absence.

There are a number of issues labeled as beginner in JIRA now, especially
with respect to the Algorithms package.

It would probably be good to include a lot of this information in a web
page either here https://mahout.apache.org/developers/how-to-contribute.html
Apache Mahout: Scalable machine learning and data 
mining
mahout.apache.org
How to contribute¶ Contributing to an Apache project is about more than just 
writing code -- it's about doing what you can to make the project better.



or on a page that is linked to by that.

Which leads me in to the last 'piece of the puzzle' I would like to have in
place before aggressively advertising this as a "new-contributor friendly"
project, migrating CMS to Jekyll
https://issues.apache.org/jira/browse/MAHOUT-1933

The rationale for that is so when new algorithms are submitted, the PR will
include relevant documentation (as a convention) and that documentation can
be corrected / expanded as needed in a more non-committer friendly manner.






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3=400]

rawkintrevo (Trevor Grant) · GitHub
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack 

Re: New Mahout Recommender Service

2014-09-09 Thread Saikat Kanjilal
I have expressed interest in the past in building a recommender based on 
elasticsearch, I can see building an API that performs targeted queries on top 
of recommendations as a service endpoint.

Sent from my iPhone

 On Sep 9, 2014, at 7:29 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Now that we have the basis of several significant improvements to Mahout’s 
 recommender it seems like we need to go the last step and provide a service. 
 Without this it is left to the user to do a lot of integration making the 
 current next gen somewhat incomplete.
 
 Using the Hadoop mapreduce code you can get all recs for all people using 
 collaborative filtering data or you can use the in-memory single machine 
 recommender if you have a small dataset. 
 
 The next generation would require Solr or Elasticsearch so why not go the 
 extra step and provide a recommender API on top? At very least it would give 
 users a single machine API they can call, analogous to the in-memory 
 recommender of Mahout 0.9. But it would also be indefinitely scalable.
 
 Is anyone interested in discussing this here?


Re: New Mahout Recommender Service

2014-09-09 Thread Saikat Kanjilal
Maybe we can use spark to do the updates in real time to the recs

Sent from my iPhone

 On Sep 9, 2014, at 7:37 AM, Peng Zhang pzhang.x...@gmail.com wrote:
 
 That will be a great feature. 
 
 Currently if the offline brach job will run hours to update the recs. Can 
 this api update recs in realtime? i.e. can we update the recs for a user 
 based on her last few behaviors 5 minutes ago?
 
 
 On Sep 9, 2014, at 10:28 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Now that we have the basis of several significant improvements to Mahout’s 
 recommender it seems like we need to go the last step and provide a service. 
 Without this it is left to the user to do a lot of integration making the 
 current next gen somewhat incomplete.
 
 Using the Hadoop mapreduce code you can get all recs for all people using 
 collaborative filtering data or you can use the in-memory single machine 
 recommender if you have a small dataset. 
 
 The next generation would require Solr or Elasticsearch so why not go the 
 extra step and provide a recommender API on top? At very least it would give 
 users a single machine API they can call, analogous to the in-memory 
 recommender of Mahout 0.9. But it would also be indefinitely scalable.
 
 Is anyone interested in discussing this here?
 


Re: New Mahout Recommender Service

2014-09-09 Thread Saikat Kanjilal
@Pat Any interest in using http://vertx.io instead of play, have heard some 
really good perf stats around this

We should really start a jira with a list of use cases and then back into a 
tech stack and outline the design in jira, thoughts ?

Sent from my iPhone

 On Sep 9, 2014, at 8:44 AM, Martin, Nick nimar...@pssd.com wrote:
 
 Would absolutely love an ES integration.
 
 -Original Message-
 From: Pat Ferrel [mailto:p...@occamsmachete.com] 
 Sent: Tuesday, September 09, 2014 10:29 AM
 To: user@mahout.apache.org
 Subject: New Mahout Recommender Service
 
 Now that we have the basis of several significant improvements to Mahout's 
 recommender it seems like we need to go the last step and provide a service. 
 Without this it is left to the user to do a lot of integration making the 
 current next gen somewhat incomplete.
 
 Using the Hadoop mapreduce code you can get all recs for all people using 
 collaborative filtering data or you can use the in-memory single machine 
 recommender if you have a small dataset. 
 
 The next generation would require Solr or Elasticsearch so why not go the 
 extra step and provide a recommender API on top? At very least it would give 
 users a single machine API they can call, analogous to the in-memory 
 recommender of Mahout 0.9. But it would also be indefinitely scalable.
 
 Is anyone interested in discussing this here?


Re: Solr recommender

2014-04-26 Thread Saikat Kanjilal


Sent from my iPad

 On Apr 26, 2014, at 9:18 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
 Is it worth it to add in the elasticsearch piece into the demo and tie that 
 into a generic mvc framework like spring, in fact we could leverage spring 
 data's elasticsearch plugin.
 
 Sent from my iPad
 
 On Apr 26, 2014, at 9:08 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Yes, it already does. It’s not named well, all it really does is create an 
 indicator matrix (item-item similarity using LLR) in a form that is 
 digestible by a text indexer. You could use Solr or ElasticSearch to do the 
 indexing and queries.
 
 In the actual installation on the demo site https://guide.finderbots.com the 
 indicator matrix is put into a DB and Solr is used to index the item 
 collection’s similarity data field. The queries are handled by the web app 
 framework. If I swapped out Solr for ElasticSearch for indexing the DB, it 
 would work just fine and I looked into how to integrate it with my web app 
 framework (RoR). The integration methods were significantly different though 
 so I chose not to do both.
 
 The reason I chose to put the indicator matrix in the DB is because it makes 
 it very convenient to mix metadata into the recs queries. In the case of the 
 demo site where the items are videos I have a bunch of recommendation types:
 1) user-history based reqs—query is recent user “likes” history, the query 
 is on the videos collection specifying the similar items field, which is a 
 list of video id strings. This is most usually what people think a 
 recommender does but is only the start.
 2-9 are use various methods of biasing the results by genre metadata. Search 
 engines also allow filtering by fields so you can specify videos filtered by 
 source. So you can get comedies based on your “likes” filtered by source = 
 Netflix. in fact when you set the source filter to Netflix every set of recs 
 will contain only those on Netflix
 
 There are so many ways to combine bias with filter and what you use as the 
 query, that putting the fields in a DB made the most sense. I am still 
 thinking of new ways to use this. For instance item-set similarity, which is 
 used to give shopping cart recs in some systems. On the demo site you could 
 do the same with the watchlist if there were enough watchlists. Use the 
 user’s watchlist as query against all otehr watchlists and get back an 
 ordered set of watchlists most similar to yours, take recs from there.
 
 Some day I’ll write some blog posts about it but I’d encourage anyone with 
 data to try the DB route rather than raw indexing of the text files just for 
 the amazing flexibility and convenience it brings.
 
 On Apr 26, 2014, at 8:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
 Pat,
 I was wondering if you'd given any thought to genericizing the Solr 
 recommender to work with both Solr and elasticsearch, namely are there 
 pieces of the recommender that could plug into or be lifted above a search 
 engine ( or in the case of elasticsearch a set of rest APIs).  I would be 
 very interested in helping out with this.
 
 Thoughts?
 
 Sent from my iPad
 


Re: Solr recommender

2014-04-26 Thread Saikat Kanjilal
That shouldn't technically matter, my thought is to create a spring based 
elasticsearch recommender that leverages spark cooccurrence underneath.

Sent from my iPad

 On Apr 26, 2014, at 10:07 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Oh, and the example is old hadoop mapreduce, we’re redoing this with the new 
 Spark cooccurrence code, which will replace ItemSimilarity job.
 
 On Apr 26, 2014, at 10:03 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 If you want, fork the github repo, do the integration and create a pull 
 request. If the pull is accepted it will automatically be included in the 
 Mahout build’s examples.
 
 Some things to consider:
 1) It is actually easier to use either Solr/Lucid/ElasticSearch’s web GUI for 
 bare-bones illustration purposes. You’d have to enter the recs query by hand. 
  For demo purposes some example queries could be created ahead of time to 
 illustrate the recs generating queries. I did this myself but didn’t include 
 it in the example. I’d actually recommend this as a simple illustration.
 2) I’d suspect the Solr+DB integration route would be the most common way 
 people would actually use this but I could be wrong. This is what I did on 
 the demo site but far beyond what you’d put in an example.
 3) What data to use? Unless the data has human readable item ids, the demo is 
 not as compelling
 
 I can’t give you the demo site’s data since I mined the web for it, which 
 allows me to use it but I don’t think I can republish it. Data actually 
 gathered on the site by users I could share but there isn’t enough to work 
 with. Maybe Ted has some from his demo.
 
 On Apr 26, 2014, at 9:18 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
 
 
 Sent from my iPad
 
 On Apr 26, 2014, at 9:18 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
 Is it worth it to add in the elasticsearch piece into the demo and tie that 
 into a generic mvc framework like spring, in fact we could leverage spring 
 data's elasticsearch plugin.
 
 Sent from my iPad
 
 On Apr 26, 2014, at 9:08 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Yes, it already does. It’s not named well, all it really does is create an 
 indicator matrix (item-item similarity using LLR) in a form that is 
 digestible by a text indexer. You could use Solr or ElasticSearch to do the 
 indexing and queries.
 
 In the actual installation on the demo site https://guide.finderbots.com 
 the indicator matrix is put into a DB and Solr is used to index the item 
 collection’s similarity data field. The queries are handled by the web app 
 framework. If I swapped out Solr for ElasticSearch for indexing the DB, it 
 would work just fine and I looked into how to integrate it with my web app 
 framework (RoR). The integration methods were significantly different 
 though so I chose not to do both.
 
 The reason I chose to put the indicator matrix in the DB is because it 
 makes it very convenient to mix metadata into the recs queries. In the case 
 of the demo site where the items are videos I have a bunch of 
 recommendation types:
 1) user-history based reqs—query is recent user “likes” history, the query 
 is on the videos collection specifying the similar items field, which is a 
 list of video id strings. This is most usually what people think a 
 recommender does but is only the start.
 2-9 are use various methods of biasing the results by genre metadata. 
 Search engines also allow filtering by fields so you can specify videos 
 filtered by source. So you can get comedies based on your “likes” filtered 
 by source = Netflix. in fact when you set the source filter to Netflix 
 every set of recs will contain only those on Netflix
 
 There are so many ways to combine bias with filter and what you use as the 
 query, that putting the fields in a DB made the most sense. I am still 
 thinking of new ways to use this. For instance item-set similarity, which 
 is used to give shopping cart recs in some systems. On the demo site you 
 could do the same with the watchlist if there were enough watchlists. Use 
 the user’s watchlist as query against all otehr watchlists and get back an 
 ordered set of watchlists most similar to yours, take recs from there.
 
 Some day I’ll write some blog posts about it but I’d encourage anyone with 
 data to try the DB route rather than raw indexing of the text files just 
 for the amazing flexibility and convenience it brings.
 
 On Apr 26, 2014, at 8:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
 Pat,
 I was wondering if you'd given any thought to genericizing the Solr 
 recommender to work with both Solr and elasticsearch, namely are there 
 pieces of the recommender that could plug into or be lifted above a search 
 engine ( or in the case of elasticsearch a set of rest APIs).  I would be 
 very interested in helping out with this.
 
 Thoughts?
 
 Sent from my iPad
 
 
 


RE: Welcome Andrew Musselman as new comitter

2014-03-07 Thread Saikat Kanjilal
Congrats Andrew, I've taken the coursera course, it was interesting but was 
hoping it could cover some more in the area of deep learning.

 Date: Fri, 7 Mar 2014 12:19:52 -0600
 Subject: Re: Welcome Andrew Musselman as new comitter
 From: scottcc...@gmail.com
 To: user@mahout.apache.org
 
 I personally am looking forward to the ³advice from the newest
 ³recommended² committer to hadoop.
 
 Congratulations to Mahout team for increasing and growing  :)
 
 Now back to my using Š.  (and hopefully creating something meaningful for
 you guys)
 
 
 Scott
 
 PS:  am bootstrapping my Machine Learning knowledge by taking the coursera
 course offered by Andrew NG - correct my shaky knowledge of classifiers.
 Anyone else on this list taking or have taken this course?  (obviously -
 committers are probably not, but Š.)
 
 
 On 3/7/14, 11:36 AM, Andrew Musselman andrew.mussel...@gmail.com wrote:
 
 Thank you for the welcome!  Looking forward to it.
 
 I have a math background and got started with recommenders by building the
 first album recommender for Rhapsody ( http://rhapsody.com ) while I was
 doing web development and web services work for the service.  Since then I
 learned to love/hate Pig and Hadoop for a living, and now I do data
 engineering and analytics at Accenture.
 
 We've used Mahout on a few production projects, and we're looking forward
 to more.
 
 See you on the lists!
 
 Best
 Andrew
 
 
 On Fri, Mar 7, 2014 at 9:12 AM, Sebastian Schelter s...@apache.org wrote:
 
  Hi,
 
  this is to announce that the Project Management Committee (PMC) for
 Apache
  Mahout has asked Andrew Musselman to become committer and we are
 pleased to
  announce that he has accepted.
 
  Being a committer enables easier contribution to the project since in
  addition to posting patches on JIRA it also gives write access to the
 code
  repository. That also means that now we have yet another person who can
  commit patches submitted by others to our repo *wink*
 
  Andrew, we look forward to working with you in the future. Welcome! It
  would be great if you could introduce yourself with a few words :)
 
  Sebastian
 
 
 
  

Flying Wheels 2014, are you interested

2014-02-28 Thread Saikat Kanjilal
http://www.cascade.org/flying-wheels-summer-century
Let me know.Thanks in advance.

Re: sql data model w/where clause

2013-03-24 Thread Saikat Kanjilal
Matt,
One idea might be to create a trigger off this table that fires to return a 
subset of the records based on the tag you describe below.  Would that work for 
you?

Sent from my iPhone

On Mar 24, 2013, at 5:11 PM, Matt Mitchell goodie...@gmail.com wrote:

 Hi,
 
 I have a table of user preferences with the following columns:
 
 user_id
 item_id
 tag
 
 I want to build a data model in mahout, but not use the entire table. I'd
 like to add a where clause like where tag = 'A' when building the model
 instance. Is this possible? If not, any way around this besides creating a
 view or new table?
 
 Thanks,
 Matt


RE: Reg Classification Problem..

2013-02-14 Thread Saikat Kanjilal

Hey Vignesh,Are there specific things you need, I've built a classification 
implementation in the past with naive bayes and a real time service to serve up 
the results of this data.  Let me know if you have specific questions.Regards

 Date: Thu, 14 Feb 2013 10:18:31 +0530
 Subject: Reg Classification Problem..
 From: vigneshkln...@gmail.com
 To: user@mahout.apache.org
 
 Hi,
 
 can anyone help me with Some Real World implementation of some
 Classification algorithms ..
 
 -- 
 Thanks and Regards
 Vignesh Srinivasan
 9739135640
  

Re: can i run mahout algorithms on mobile device..

2013-01-30 Thread Saikat Kanjilal
Hi Vignesh,
Do you really need mahout for this , you could just write the classification 
algorithm yourself and run it on a mobile lite offline storage as needed. Ping 
me offline if you want to discuss in more detail.

Regards

Sent from my iPhone

On Jan 30, 2013, at 7:46 AM, Mahesh Balija balijamahesh@gmail.com wrote:

 AFAIK it is NOT possible. As Mahout runs on top of Hadoop.
 Also Hadoop is a distributed computing framework, it will run on cluster of
 machines.
 So ideally it may NOT be possible to run on a Mobile.
 
 On Wed, Jan 30, 2013 at 8:46 PM, VIGNESH S vigneshkln...@gmail.com wrote:
 
 I am trying to implement some classification in android mobile device..
 
 is it possible to use mahout in mobile device..Please kindly help me
 
 --
 Thanks and Regards
 Vignesh Srinivasan
 9739135640
 


RE: evaluating distributed recommendation results

2012-09-07 Thread Saikat Kanjilal

You could do this several ways:
1) You could see whether or not users respond to 1 style of recommendations 
obtained through 1 type of similarity coefficient versus the others, meaning 
did they click on a particular recommendation obtained through Tanimoto versus 
loglikelihood2) You could also use something similar to DCG 
(http://en.wikipedia.org/wiki/Discounted_cumulative_gain) to figure out how 
good each algorithm is compared to another

 From: goodie...@gmail.com
 Date: Fri, 7 Sep 2012 18:22:47 -0400
 Subject: evaluating distributed recommendation results
 To: user@mahout.apache.org
 
 Hi,
 
 I'm generating item similarities and recommendations using the
 distributed jobs. Is there a way I can evaluate the results? The MIA
 book describes how to do this with the non-distributed recommenders,
 but I can't find anything on evaluating the distributed stuff. Any
 tips on doing this?
 
 Thanks,
 Matt
  

RE: question about distributed recommendations

2012-08-03 Thread Saikat Kanjilal

Matt,I'm also deep in the midst of building out such a system, basically I have 
a most of a system in place that:1) replenishes user ratings data directly into 
hdfs from analytics2) performs mahout item similarity computations on this data 
and stores the result back into hdfs3) uses hive to then transform the results 
of number 2 into a real time low latency key value database, in this case 
cassandra4) leverages a rest based web service to query that database and 
serves up results into a UI

I am currently working on the second piece which includes 
clustering/classifying that  that data based on a set of dynamic features.

 Date: Fri, 3 Aug 2012 18:21:56 -0400
 Subject: Re: question about distributed recommendations
 From: sro...@gmail.com
 To: user@mahout.apache.org
 
 Good good question. One straightforward way to approach things is to
 compute all recommendations offline, in batch, and publish them to some
 location, and then simply read them as needed. Yes your front-end would
 need to access HDFS if the data were on HDFS. The downside is that you
 can't update in real-time, and you spend CPU computing recs for people that
 may never be needed.
 
 The online implementations you've been playing with don't have those two
 problems, but they have scale issues at some point.
 
 But, I think one of these two approaches is probably 'just fine' for 80% of
 use cases.
 
 
 If not, the 'real' answer is a hybrid solution, using Hadoop to do periodic
 model recomputation, offline, and using front-ends to do (at least
 approximate) real-time updates and computation. This sort of system is what
 I'm trying to build with Myrrix (myrrix.com), which you may be interested
 in if you have this kind of problem.
 
 
 On Fri, Aug 3, 2012 at 6:16 PM, Matt Mitchell goodie...@gmail.com wrote:
 
  Thanks Sean, that makes sense. I'll look into the source and see if I
  can find learn more.
 
  Another question. I understand how the recommendations are created.
  I'd like to wrap this all up as a web service, but I'm not sure I
  understand how one would go about doing that? How would one app, fetch
  recomendations for a user? Does my app need access to the HDFS file
  system?
 
  Thanks again.
 
 
  

Re: Question about hadoop based mahout

2012-07-16 Thread Saikat Kanjilal
Hi Swapna,
I can't give you source code for legal reasons but you should be able to setup 
Hadoop based map reduce jobs that invoke mahout item similarity and clustering 
algorithms to run in offline mode in a local or dev hadoop cluster. I wrote 
some scala code to invoke mahout and stream out results to stdout and then 
store the results inside hdfs.  The mahout documentation as well as the unit 
tests should help you get this configured and started.  For the clustering 
algorithms you will need to have a process to take a csv file and generate a 
sequence vector that serves as the input.   

After the offline process completes you can then have a script or some other 
process that moves a subset (or maybe all) the offline data computed by mahout 
to a real time low latency database to render out these results through a 
webapp.  I have most of this infrastructure working at the moment.  Also please 
direct these to the user alias so everyone can benefit from the discussion.

Let me know if you have more specific questions.

Regards

Sent from my iPhone

On Jul 16, 2012, at 4:26 PM, Swapna Yeleswarapu yswa...@gmail.com wrote:

 Hi Saikat,
 
 I was reading your questions on 
 
 http://comments.gmane.org/gmane.comp.apache.mahout.user/13362
 
 And was wondering if you got a chance to implement what you were trying to 
 do. I am trying to do something similar for learning stuff.
 
 Can you tell me example of how you did a hybrid mode(Offline precomputation 
 and online reco)?
 
 Could appreciate it if you had any readily available source code which could 
 help me set things up.
 
 Thanks
 Swapna


RE: ItemSimilarity algorithm

2012-07-05 Thread Saikat Kanjilal

Thanks for the input Sean, one other question, in the scenario where most of 
the recommendations are boolean style recommendations (i.e. a csv file that 
just says that a user has some sort of association with an item), is it fair to 
say that the tanimoto and loglikelihood coefficients perform better than the 
other coefficients.  I wanted to get a deeper understanding of this as well, 
thanks for your insight.

 Date: Tue, 3 Jul 2012 19:19:07 +0300
 Subject: Re: ItemSimilarity algorithm
 From: sro...@gmail.com
 To: user@mahout.apache.org
 
 Item-item similarity is a property of the information you have on two
 items and just those items. Whether there are just those 2 items over
 500K users, or 2M items over 500K users, makes no difference. So no I
 don't think that this skew implies you should use any particular
 algorithm, by itself.
 
 I think other considerations tend to dominate. For example very sparse
 data makes Pearson / cosine measure not work well. But with so
 relatively few items... I imagine it is not so sparse.
 
 On Tue, Jul 3, 2012 at 6:57 PM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
  Hello Everyone,I was reading through the documentation on the different 
  itemsimilarity algorithms in mahout and had a question, if one has a 
  scenario where the number of items are significantly less  than the number 
  of users (say 500,000 users to 1000 items) are there particular item 
  similarity coefficients (namely logLikelihood or tanimoto coeeficient) that 
  lend themself to producing better recommendations, I've read through the 
  Mahout in action and the java docs and cant seem to find any clues on this. 
   Any insight based on your experience would be much appreciated.
  Regards
  

ItemSimilarity algorithm

2012-07-03 Thread Saikat Kanjilal

Hello Everyone,I was reading through the documentation on the different 
itemsimilarity algorithms in mahout and had a question, if one has a scenario 
where the number of items are significantly less  than the number of users (say 
500,000 users to 1000 items) are there particular item similarity coefficients 
(namely logLikelihood or tanimoto coeeficient) that lend themself to producing 
better recommendations, I've read through the Mahout in action and the java 
docs and cant seem to find any clues on this.  Any insight based on your 
experience would be much appreciated.
Regards   

Re: Performance issue with Item-based Recommendation and User-based Recommendation

2012-06-21 Thread Saikat Kanjilal
I'm using the Hadoop based ItemSimilarity but will be preloading the results 
into Cassandra and using that as the real time data output, will let you know 
how it goes.

Sent from my iPhone

On Jun 21, 2012, at 2:16 PM, Way Cool way1.wayc...@gmail.com wrote:

 Hi, guys,
 
 For item-based recommendation, I pre-calculated the item similarities on
 Hadoop per algorithm, which generated 20m rows each. The problem now is I
 can't just load them into memory via MySQLJDBCInMemoryItemSimilarity with
 4GB memory. I tried MySQLJDBCItemSimilarity, however it's way too slow.
 What are the alternatives?
 
 For user-based recommendation, I can't load 100m lines of data model from
 FileDataModel into memory. It ran out of memory after 20m lines. The same
 issue with JDBCDataModel is way too slow. Does anyone precalculate the user
 similarities before and recommend items to a user?
 
 Anyone had the similar issues before?
 
 Thanks,
 
 YG


RE: How to run a mahout clustering job through a web service

2012-05-10 Thread Saikat Kanjilal

Hi Anad,We're doing something similar, kmeans should in general run 
asynchronously and dump data into a low latency database (something similar to 
cassandra) that your web application can then query and return results, so in a 
nutshell  you will have a real time component that serves up the results of 
clustering and an offfline component that computes your kmeans clusters.  Let 
me know if you want deeper details.
Regards

 Subject: How to run a mahout clustering job through a web service
 Date: Thu, 10 May 2012 12:03:12 +0530
 From: ananda.muru...@honeywell.com
 To: user@mahout.apache.org
 
 Hi, 
 
  
 
 I would like to run KMeans clustering job from a web application. So I
 want the Mahout jobs to be exposed as a web service or at least HTTP
 servlet. Is it possible? Any suggestions?
 
  
 
 Regards,
 
 Anand.C
 
  
 
  

Re: shortest-path maintenance

2012-04-20 Thread Saikat Kanjilal
Hi Mike,
Neo4j does this and is meant for this (www.neo4j.org) type of calculation.  Are 
you looking to solve this within a particular algorithm in mahout?

Regards

Sent from my iPhone

On Apr 20, 2012, at 8:28 AM, Mike Spreitzer mspre...@us.ibm.com wrote:

 Is there something in Mahout that maintains shortest paths (or simply 
 distance) from a distinguished vertex in a graph?  That is, given a graph 
 in which this problem has been solved, and a small change in that graph, 
 something that will efficiently find the answers for the graph after the 
 small change?
 
 Thanks,
 Mike


Re: shortest-path maintenance

2012-04-20 Thread Saikat Kanjilal
Is it a requirement to use map reduce?  Also how does mahout play into this?  
Potentially you could build mappers that could reference an in memory graph and 
have an API that pre calculates dijkstra or astar up front.  You could then add 
or remove a node as part of the reduce process that references this graph and 
recalculates dijkstra or astar in a closed feedback loop.  However its not 
obvious to me that mapreduce is the appropriate tool to do this.

Some more context into the problem and how mahout fits in would be great.

Sent from my iPhone

On Apr 20, 2012, at 8:56 AM, Mike Spreitzer mspre...@us.ibm.com wrote:

 I am wondering what is the best way to do this using map-reduce.  Both for 
 the special case of only adding edges, and the general case where edges 
 may be removed.
 
 Thanks,
 Mike
 
 
 
 From:   Saikat Kanjilal sxk1...@hotmail.com
 To: user@mahout.apache.org user@mahout.apache.org
 Date:   04/20/2012 11:46 AM
 Subject:Re: shortest-path maintenance
 
 
 
 Hi Mike,
 Neo4j does this and is meant for this (www.neo4j.org) type of calculation. 
 Are you looking to solve this within a particular algorithm in mahout?
 
 Regards
 
 Sent from my iPhone
 
 On Apr 20, 2012, at 8:28 AM, Mike Spreitzer mspre...@us.ibm.com wrote:
 
 Is there something in Mahout that maintains shortest paths (or simply 
 distance) from a distinguished vertex in a graph?  That is, given a 
 graph 
 in which this problem has been solved, and a small change in that graph, 
 
 something that will efficiently find the answers for the graph after the 
 
 small change?
 
 Thanks,
 Mike
 
 


Re: Evalutation of recommenders

2012-04-11 Thread Saikat Kanjilal
We'll be writing this ourselves :))) the goal is to build the best of breed 
recommendations engine for our division of the company and then generalize it 
for the larger company.  Thanks for the heads up.

Sent from my iPhone

On Apr 11, 2012, at 12:31 AM, Manuel Blechschmidt manuel.blechschm...@gmx.de 
wrote:

 Hi Saikat,
 I wrote my master thesis about evaluating recommender in real world examples:
 
 https://source.apaxo.de/svn/semrecsys/trunk/doc/2010-Manuel-Blechschmidt-730786-EvalRecSys.pdf
 
 So what you are  going to do is current research. This means that there are 
 currently not a lot of experiences.
 
 In 2009 there was an online evaluation challenges which was part of ECML PKDD.
 
 2009 ECML PKDD Discovery Challeng: Online Tag Recommendations. http:// 
 www.kde.cs.uni-kassel.de/ws/dc09/online. Version: 2009, Checked: 2011-04-23
 
 You will have to run all your recommenders in parallel to figure out which 
 one is the best one for optimizing business goals. I founded a company which 
 is developing the described technology. I am currently searching for a 
 project starting in July 2012 where I can try this. So if you are interested 
 in hiring my feel free to send me a personal message.
 
 /Manuel
 
 On 10.04.2012, at 17:41, Saikat Kanjilal wrote:
 
 
 Hi everyone,We're looking at building out some clustering and classification 
 algorithms using mahout and one of the things we're also looking at doing is 
 to build performance metrics around each of these algorithms, as we go down 
 the path of choosing the best model in an iterative closed feedback loop 
 (i.e. our business users manipulate weights for each attribute for our 
 feature vectors-we use these changes to regenerate an asynchronous model 
 using the appropriate clustering/classification algorithms and then 
 replenish our online component with this newly recalculated data for fresh 
 recommendations).   So our end goal is to have a basket of algorithms and 
 use a set of performance metrics to pick and choose the right algorithm on 
 the fly.  I was wondering if anyone has done this type of analysis before 
 and if so are there approaches that have worked well and approaches that 
 haven't when it comes to measuring the quality of each of the 
 recommendation algorithms.
 Regards   
 
 -- 
 Manuel Blechschmidt
 CTO - Apaxo GmbH
 blechschm...@apaxo.de
 http://www.apaxo.de
 
 Weinbergstr. 16
 14469 Potsdam
 
 Telefon +49 (0)6204 9180 593
 Fax +49 (0)6204 9180 594
 Mobil: +49 173/6322621
 
 Skype: Manuel_B86
 Twitter: http://twitter.com/Manuel_B
 
 Sitz der Gesellschaft: Viernheim
 Handelsregister HRB 87159
 Ust-IdNr. DE261368874
 Amtsgericht Darmstadt
 Geschäftsführer Friedhelm Scharhag
 
 


RE: Evalutation of recommenders

2012-04-10 Thread Saikat Kanjilal

I see the architecture similar to  the following:

Asynchronously:Given a set of feature vectorsrun clustering/classification 
algorithms for each of our feature vectors to create the appropriate buckets 
for the set of users, feed the result of these computations into the 
synchronous database.
Synchronously:For each bucket run item similarity recommendation algorithms to 
display a real time set of recommendations for each user

For the asynchronous computations we need the ability to tweak the weights 
associated with each feature of the feature vectors (typical features might 
include income/age/dining preferences etc) and we need the business folks to 
adjust the weights associated with each of these to regenerate the async buckets

So given the above architecture we need the ability for the async computations 
to judge which algorithm to use based on a set of performance measuring 
criteria, that was the heart of my initial question, whether folks have built 
this sort of framework and what are some things to think about when building 
this.
Thanks for your feedback



 Date: Tue, 10 Apr 2012 14:33:56 -0500
 Subject: Re: Evalutation of recommenders
 From: sro...@gmail.com
 To: user@mahout.apache.org
 
 You're talking about recommendations now... are we talking about a
 clustering, classification or recommender system?
 
 In general I don't know if it makes sense for business users to be
 deciding aspects of the internal model. At most someone should input
 the tradeoffs -- how important is accuracy vs speed? those kinds of
 things. Then it's an optimization problem. But, understood, maybe you
 need to let people explore these things manually at first.
 
 On Tue, Apr 10, 2012 at 2:21 PM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
  The question really is what are some tried approaches to figure out how to 
  measure the quality of a set of algorithms currently being used for 
  clustering/classification?
 
  And in thinking about this some more we also need to be able to regenerate 
  models as soon as the business users tweak the weights associated with 
  features inside a feature vector, we need to figure out a way to 
  efficiently tie this into our online workflow which could show updated 
  recommendations every few hours?
 
  When I say picking an algorithm on the fly what I mean is that we need to 
  continuously test our basket of algorithms based on a new set of training 
  data and make the determination offline as to which of the algorithms to 
  use at that moment to regenerate our recommendations.
  Date: Tue, 10 Apr 2012 14:08:17 -0500
  Subject: Re: Evalutation of recommenders
  From: sro...@gmail.com
  To: user@mahout.apache.org
 
  Picking an algorithm 'on the fly' is almost surely not realistic --
  well, I am not sure what eval process you would run in milliseconds.
  But it's also unnecessary; you usually run evaluations offline on
  training/test data that reflects real input, and then, the resulting
  tuning should be fine for that real input that comes the next day.
 
  Is that really the question, or are you just asking about how you
  measure the quality of clustering or a classifier?
 
  On Tue, Apr 10, 2012 at 10:41 AM, Saikat Kanjilal sxk1...@hotmail.com 
  wrote:
  
   Hi everyone,We're looking at building out some clustering and 
   classification algorithms using mahout and one of the things we're also 
   looking at doing is to build performance metrics around each of these 
   algorithms, as we go down the path of choosing the best model in an 
   iterative closed feedback loop (i.e. our business users manipulate 
   weights for each attribute for our feature vectors-we use these changes 
   to regenerate an asynchronous model using the appropriate 
   clustering/classification algorithms and then replenish our online 
   component with this newly recalculated data for fresh recommendations).  
So our end goal is to have a basket of algorithms and use a set of 
   performance metrics to pick and choose the right algorithm on the fly.  
   I was wondering if anyone has done this type of analysis before and if 
   so are there approaches that have worked well and approaches that 
   haven't when it comes to measuring the quality of each of the 
   recommendation algorithms.
   Regards
 
  

RE: Evalutation of recommenders

2012-04-10 Thread Saikat Kanjilal

Yes we have business users who are putting measures on a real world metric and 
in turn provide that level of feedback by putting some weighting on some 
algorithm parameters to tweak results, the results should be different and will 
be driven off from this.
Thanks again for your insight on recommender metrics, will look at implementing 
these, will post more as we get this off the ground as we run into challenging 
scenarios.

 Date: Tue, 10 Apr 2012 16:34:33 -0500
 Subject: Re: Evalutation of recommenders
 From: sro...@gmail.com
 To: user@mahout.apache.org
 
 You are making recommendations, and you want to do this via
 clustering. OK, that's fine. How you implement it isn't so important
 -- it's that you have some parameters to change and want to know how
 any given process does.
 
 You just want to use some standard recommender metrics, to start, I'd
 imagine. If you're estimating ratings -- root mean squared error of
 the difference between estimate and actual on the training data. Or
 you can fall back to precision, recall, and nDCG as a form of score.
 So, yes, definitely well-established approaches here.
 
 I have this sense that you are saying you have business users who are
 going to measure some real-world metric (conversion rate, uplift,
 clickthrough), and guess at some changes to algorithm parameters that
 might make them better. If you have *that* kind of feedback -- much
 better. That is a far more realistic metric. Of course, it's much
 harder to experiment when using that metric since you have to run the
 algo for a day or something to collect data.
 
 It's a separate question, but I don't know if in the end a business
 user can meaningfully decide weights on feature vectors. I mean, I
 couldn't eyeball those kinds of things. It may just be how you need to
 do things, but would double-check that everyone has a similar and
 reasonable expectation about what these inputs are and what they do.
 
 
 On Tue, Apr 10, 2012 at 3:23 PM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
  I see the architecture similar to  the following:
 
  Asynchronously:Given a set of feature vectorsrun 
  clustering/classification algorithms for each of our feature vectors to 
  create the appropriate buckets for the set of users, feed the result of 
  these computations into the synchronous database.
  Synchronously:For each bucket run item similarity recommendation algorithms 
  to display a real time set of recommendations for each user
 
  For the asynchronous computations we need the ability to tweak the weights 
  associated with each feature of the feature vectors (typical features might 
  include income/age/dining preferences etc) and we need the business folks 
  to adjust the weights associated with each of these to regenerate the async 
  buckets
 
  So given the above architecture we need the ability for the async 
  computations to judge which algorithm to use based on a set of performance 
  measuring criteria, that was the heart of my initial question, whether 
  folks have built this sort of framework and what are some things to think 
  about when building this.
  Thanks for your feedback
 
 
 
  Date: Tue, 10 Apr 2012 14:33:56 -0500
  Subject: Re: Evalutation of recommenders
  From: sro...@gmail.com
  To: user@mahout.apache.org
 
  You're talking about recommendations now... are we talking about a
  clustering, classification or recommender system?
 
  In general I don't know if it makes sense for business users to be
  deciding aspects of the internal model. At most someone should input
  the tradeoffs -- how important is accuracy vs speed? those kinds of
  things. Then it's an optimization problem. But, understood, maybe you
  need to let people explore these things manually at first.
 
  On Tue, Apr 10, 2012 at 2:21 PM, Saikat Kanjilal sxk1...@hotmail.com 
  wrote:
  
   The question really is what are some tried approaches to figure out how 
   to measure the quality of a set of algorithms currently being used for 
   clustering/classification?
  
   And in thinking about this some more we also need to be able to 
   regenerate models as soon as the business users tweak the weights 
   associated with features inside a feature vector, we need to figure out 
   a way to efficiently tie this into our online workflow which could show 
   updated recommendations every few hours?
  
   When I say picking an algorithm on the fly what I mean is that we need 
   to continuously test our basket of algorithms based on a new set of 
   training data and make the determination offline as to which of the 
   algorithms to use at that moment to regenerate our recommendations.
   Date: Tue, 10 Apr 2012 14:08:17 -0500
   Subject: Re: Evalutation of recommenders
   From: sro...@gmail.com
   To: user@mahout.apache.org
  
   Picking an algorithm 'on the fly' is almost surely not realistic --
   well, I am not sure what eval process you would run in milliseconds.
   But it's also unnecessary; you usually

RE: Successful Organization Meeting for Austin SIGKDD

2011-11-30 Thread Saikat Kanjilal

Hi everyone,I'd love to setup a hacker dojo similar to what David is doing in 
Austin in the Seattle area, are there other folks interested in doing this with 
a similar theme.  Please let me know.  This is great way to do deep dives on 
some of the algorithms in mahout.Regards

 From: l...@semanticartifacts.com
 Subject: Successful Organization Meeting for Austin SIGKDD
 Date: Wed, 30 Nov 2011 00:35:05 -0600
 To: user@mahout.apache.org
 
 The organization meeting for Austin SIGKDD was an outstanding success. 
 Seventeen people attended the meeting. Everyone was very interested in 
 furthering their professional skills and starting a weekly hackers dojo. The 
 focus of the group will be on big data machine learning. For the initial 
 couple of months we will learn how to use Mahout with Hadoop, looking at the 
 math, algorithms, and software architecture along the way. In the future we 
 expect to have one of more topic tracks where people can work on projects to 
 improve their skills. 
 
 We decided to postpone voting to form a local chapter of ACM SIGKDD for a 
 month or two, to see how things work out. We did decide to call ourselves the 
 Austin SIGKDD for the interim. If we become an ACM chapter, we can change the 
 name the Austin ACM SIGKDD. We discuss starting a virtual dojo using 
 openmeeting for those who can not attend the dojo on a weekly basis. The 
 primary meeting places will be the Austin Northwest Recreation Center, or 
 Mangia Pizza as a backup. Both have free easy access and free Wifi. We 
 discuss a format of trying to get the meeting place from 6-9, where we will 
 have a structure activity, a presentation or lesson, from 7-8, where the time 
 before and after is for hacking, depending on the members scheduling needs. 
 The structured activity will be organized with slides and recorded audio so 
 we can capture the presentation for those who can not be there or for future 
 newcomers to the group.
 
 We decided to meet on Wednesday nights in the future. 
 
 The group now has its own web page and mailing list at Yahoo groups. The 
 group has open membership. Please join to continue to receive information on 
 the group. Membership in the group is required to post to the mailing list.
 
 Here are the details on austinsigkdd:
 Group home page: http://groups.yahoo.com/group/austinsigkdd 
 Group email address: austinsig...@yahoogroups.com 
 
 The next group meeting is Wednesday, December 7. I will send out a formal 
 announcement on the SIGKDD mailing list once I book the room.
 
 I am requesting one or two volunteers to also be the admins for the various 
 group web sites in case I die, get sick, get arrested, or move away :)
 
 The next meeting will focus on setting up a software development environment:
 1. Selecting and installing a hypervisor.
 2. Selecting and installing a Linux VM.
 3. Installation of Cloudera Hadoop and Cloudera Mahout.
 4. Installing Eclipse and plugins.
 5. Downloading source for Mahout.
 
 Sincerely,
 David G. Boney
 l...@semanticartifacts.com
  

Re: Could you improve the AbstractJDBCDataModel?

2010-08-07 Thread Saikat Kanjilal
In the JDBC world if you want to change your datamodel frequently you should 
write some SQL scripts that recreate the database from scratch and accept 
inputs that abstract the new tables or fields to setup.  If you want more 
details let me know as I've done this many times.

Sent from my iPhone

On Aug 7, 2010, at 7:35 PM, Young woshidus...@126.com wrote:

 So if I want to change my datamodel frequently, I have to use JDBC model?
 
 
 
 
 At 2010-08-07 20:44:20,Sean Owen sro...@gmail.com wrote:
 
 No, you just update the database tables. That's the point of using the
 JDBC model.
 
 2010/8/7 Young woshidus...@126.com:
 If I want to update the datamodel frequently or in real time, what should I 
 do? Or I have to use two instance and fit the database data into memory 
 alternatively?