No, the weighting is more difficult with a search engine. It will create token 
counts and do some other things like TFIDF on the token frequencies. You can 
control how much of that gets done. If you really think you need weights you 
would create the doc like this:
neighborhoodID<tab>bookstore bookstore bookstore bookstore bookstore bookstore 
bookstore bookstore cafe cafe cafe cafe cafe cafe cafe cafe cafe cafe cafe cafe 
cafe cafe cafe cafe cafe pet-store pet-store pet-store pet-store pet-store 
pet-store pet-store pet-store pet-store pet-store pet-store pet-store pet-store 
pet-store pet-store pet-store pet-store pet-store pet-store pet-store

Or you could put each amenity in a separate Solr field so all cafe tokens would 
be in the cafe field.

You would turn off TFIDF and “norms” in Solr for indexing the above. Another 
way is to access the Lucene vectors directly and attach weights there. You can 
only query with whatever is available with the Solr query interface. But the 
query interface allows you to “boost” each field by an integer value so if pets 
are really important give it a boost in the query.

Sorry but it’s a little bit of search engine abuse but is works and is super 
fast. It allows you to put together complex queries at runtime without 
retraining your data (reindexing is automatic) and allows you to experiment 
very easily.

This site is a demo of using Solr with Mahout to make collaborative filtering 
type recommendations: https://guide.finderbots.com 

On Jul 11, 2014, at 8:11 AM, Edith Au <edith...@gmail.com> wrote:

Thanks for the suggestion Pat.  With the search engine approach, I would
imagine I could do ...

neighborhoodID<tab>bookstore,30 cafe,50 pet-store,10 pet-park,3

and then I can do a "LIKE" query to pick up the right docID and parse the
doc for weights.  But once I get a collection of neighborhoodID and
weights, I still need some way to compare similarity (between the user
existing neighborhood and the search results).  Now I am back to Mahout (or
some other math package I can use to find the strength)?

Thanks for the book recommendation.  I finished "Mahout in Action" last
week but I am sure the book is pretty out of date by now.


On Thu, Jul 10, 2014 at 8:01 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> You need hadoop installed on the machine you run on but don’t need HDFS or
> a cluster. This is called local mode where you set MAHOUT_LOCAL=true and
> use the local file system.
> 
> If you want to customize the query at runtime I suggest a search engine.
> Using rowsimilarity you can only train in batch and can only pre-calculate
> recs. If you index the neighborhoods by feature you can construct a query
> at runtime and get fast results. So you can say that the user want pets
> (even though their current property doesn’t allow them). This would be
> something new, not related to their neighborhood. It is easily added to a
> search query. Weights are not so easy using a search engine but may not
> matter. Imagine indexing
> 
> hieghborhoodID<tab>bookstore cafe pet-store pet-park
> 
> heighborhoodID is the docID, the rest is the document (space delimited set
> of tokens). Index this with Solr or something else. Then at runtime do a
> fulltext query with “bookstore pet-store pet-park” or however you want to
> build the query. This is actually the way we are thinking about the next
> gen of recommender—using a search engine even for collaborative filtering
> data.
> 
> Look for a book by Ted Dunning called “Practical Machine Learning”, which
> talks about this approach.
> 
> On Jul 10, 2014, at 5:07 PM, Edith Au <edith...@gmail.com> wrote:
> 
> I was under the impression that I can only run RowSimilarityJob with
> Hadoop.  I will take a look at that.  Thx!
> 
> I see your point about retraining the data over and over.  But there are
> couple other requirements I left out from my original post.
> 
> 1. There are amenities in a user existing neighborhood where she does not
> care for.  For example, if she does not have any children, schools and
> day-care centers in her existing neighborhood (or suggesting neighborhoods)
> should not sway the final score.
> 
> 2. My next feature is to inject customization.  For example, she may want
> to move to a pet friendly area because her current neighborhood does not
> have enough facilities (eg. Vets, pets store, pets friendly parks) for her
> pet.
> 
> If I pre-calculate the row matrix of similar neighborhoods, I am not sure
> how I can implement the customization (by adding or removing amenities
> requirements at runtime).    Any thought on that?
> 
> Thanks for the reminder on mahout FastID.  It could easily be a newbie
> mistake to use a regular int or long for mapping.
> 
> Thanks again for your help.  Much appreciated!
> 
> 
> 
> 
> On Thu, Jul 10, 2014 at 1:40 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
>> Doing things this way you are using the neighborhoodID as a proxy for
>> userID/rowID in the recommender. I don’t see the benefit of the in-memory
>> version here since all output can easily be pre-calculated. Then it will
>> only be a lookup at runtime. You can use “rowsimilarity" on a single
>> machine without setting up a cluster, just use the local filesystem. This
>> is the way I’d do it.
>> 
>> You definitely don’t want to “only load the columns (amenities) correlate
>> to a selected user” with an in-memory recommender. This loading of data
>> will trigger a retraining of the recommender before you can ask it for
>> similar neighborhoods and that will take more time than you want. This
>> potentially will happen as each new user visits your app.
>> 
>> However if you train on all amenities for all neighborhoods then the
>> in-memory recommender should work and would train only once. Your data
>> would look like: (heighborhoodID, cafesID, numberOfCafes) and so on for
>> every non-zero cell in the table. And remember that ALL IDs must be
> Mahout
>> IDs—you can’t use your own IDs. Mahout IDs correspond to matrix
>> coordinants, they are ordinal Ints. Think of them as the row and column
>> number of the table.
>> 
>> On Jul 10, 2014, at 10:45 AM, Edith Au <edith...@gmail.com> wrote:
>> 
>> Thank you so much for the suggestions.  It took me sometime to figure
>> things out but I believe I have a pretty good grip on what's need to be
>> done now. My dataset is small enough to fit into a single machine so I am
>> going to use an in memory implementation rather than hadoop.   As
> suggested
>> by both Pat and Manuel, I have a table (in file system) with
> neighborhoods
>> as rows and amenities as columns.  In runtime, I will only load the
> columns
>> (amenities) correlate to a selected user and do a UserSimilarity
> operation
>> between each neighborhood and the one the user resides in.  After that, I
>> can pick up the NearestNUserNeighborhoods for results.
>> 
>> I gather UserSimilarity is the in-memory equivalent of RowSimilarity
>> (Hadoop) ?  It would be great if someone can confirm it!
>> 
>> Thanks again Pat and Manuel!
>> Edith
>> 
>> 
>> On Wed, Jul 2, 2014 at 4:06 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>> 
>>> If you are looking to recommend a similar neighborhood based on the
>>> characteristics of some other neighborhood (the user’s current one) so
>> you
>>> wouldn’t use collaborative filtering. This is a metadata recommender
>> based
>>> on similarity of neighborhoods not a collection of user preferences.
>>> 
>>> The easiest and fastest would be to use a search engine but I’ll leave
>>> that for now since it doesn’t account for feature weights as well.
>>> 
>>> create a table like this:
>>> Neighborhood    Gym Cafe        Bookstore
>>> Downtown        15      50              0
>>> Midtown         30      100             10
>>> …
>>> 
>>> You will need to convert the row IDs into sequential ints, which Mahout
>>> uses for IDs. Then read them into a sequenceFile creating a Distributed
>> Row
>>> Matrix, which has Key -  Value pairs. Keys = the integer neighborhood
>> IDs,
>>> the Value is a Vector (a sort of list) of column integer IDs with the
>>> counts.
>>> 
>>> Then run rowsimilarity on the DRM. This is the CLI but there is also a
>>> Driver you can call from your code.
>>> 
>>> There are some data prep issues you will have since larger neighborhoods
>>> will have higher counts. An easy thing to do would be to normalize the
>>> counts by something like population or physical size so you get cafes
> per
>>> resident or per sq mile or some other ratio.
>>> 
>>> The result of the rowsimilarity job will be another DRM of key =
>>> neightborhood ID, values = Vector of similar neighborhoods (by integer
>> ID)
>>> with a strength of similarity. Sort the vector by strength and you’ll
>> have
>>> an ordered list of similar neighborhoods for each neighborhood.
>>> 
>>> On Jun 30, 2014, at 12:48 PM, Edith Au <edith...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> I am a newbie and am looking for some guidance to implement my
>>> recommender.  Any help would be greatly appreciated.  I have a small
>>> data set of location information with the following fields:
>>> neighborhood, amenities, and counts.  For example:
>>> 
>>> Downtown          Gym 15
>>> Downtown          Cafe 50
>>> …
>>> Midtown             Gym 30
>>> Midtown             Cafe 100
>>> Midtown             Bookstore 10
>>> ...
>>> Financial Dist
>>> …
>>> 
>>> 
>>> so on and so forth.  I want to recommend a neighborhood for a user to
>>> reside base on the amenities (and some other metrics) in his/her
>>> current neighborhood.    My understanding is that model-based
>>> recommendation would be a good fit for the job.  If I am on the right
>>> track,  is there a experimental/beta recommender I can try?
>>> 
>>> 
>>> If there is no such recommender yet, can I still use Mahout for my
>>> project?  For example, can I implement my own Similarity which only
>>> computes the similarity between one user's preference to a set of
>>> neighborhood?  If I understand Mahout correctly, User/Item Similarity
>>> would do N x (N-1) pair of comparisons as oppose to 1 x N comparisons.
>>> In my example, User/Item Similarity would compare between Downtown,
>>> Midtown, Fin Dist -- which would be a waste in computation resources
>>> since the comparisons are not needed.
>>> 
>>> 
>>> Thanks in advance for your help.
>>> 
>>> Edith
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to