Thanks for the suggestion Pat.  With the search engine approach, I would
imagine I could do ...

neighborhoodID<tab>bookstore,30 cafe,50 pet-store,10 pet-park,3

and then I can do a "LIKE" query to pick up the right docID and parse the
doc for weights.  But once I get a collection of neighborhoodID and
weights, I still need some way to compare similarity (between the user
existing neighborhood and the search results).  Now I am back to Mahout (or
some other math package I can use to find the strength)?

Thanks for the book recommendation.  I finished "Mahout in Action" last
week but I am sure the book is pretty out of date by now.


On Thu, Jul 10, 2014 at 8:01 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> You need hadoop installed on the machine you run on but don’t need HDFS or
> a cluster. This is called local mode where you set MAHOUT_LOCAL=true and
> use the local file system.
>
> If you want to customize the query at runtime I suggest a search engine.
> Using rowsimilarity you can only train in batch and can only pre-calculate
> recs. If you index the neighborhoods by feature you can construct a query
> at runtime and get fast results. So you can say that the user want pets
> (even though their current property doesn’t allow them). This would be
> something new, not related to their neighborhood. It is easily added to a
> search query. Weights are not so easy using a search engine but may not
> matter. Imagine indexing
>
> hieghborhoodID<tab>bookstore cafe pet-store pet-park
>
> heighborhoodID is the docID, the rest is the document (space delimited set
> of tokens). Index this with Solr or something else. Then at runtime do a
> fulltext query with “bookstore pet-store pet-park” or however you want to
> build the query. This is actually the way we are thinking about the next
> gen of recommender—using a search engine even for collaborative filtering
> data.
>
> Look for a book by Ted Dunning called “Practical Machine Learning”, which
> talks about this approach.
>
> On Jul 10, 2014, at 5:07 PM, Edith Au <edith...@gmail.com> wrote:
>
> I was under the impression that I can only run RowSimilarityJob with
> Hadoop.  I will take a look at that.  Thx!
>
> I see your point about retraining the data over and over.  But there are
> couple other requirements I left out from my original post.
>
> 1. There are amenities in a user existing neighborhood where she does not
> care for.  For example, if she does not have any children, schools and
> day-care centers in her existing neighborhood (or suggesting neighborhoods)
> should not sway the final score.
>
> 2. My next feature is to inject customization.  For example, she may want
> to move to a pet friendly area because her current neighborhood does not
> have enough facilities (eg. Vets, pets store, pets friendly parks) for her
> pet.
>
> If I pre-calculate the row matrix of similar neighborhoods, I am not sure
> how I can implement the customization (by adding or removing amenities
> requirements at runtime).    Any thought on that?
>
> Thanks for the reminder on mahout FastID.  It could easily be a newbie
> mistake to use a regular int or long for mapping.
>
> Thanks again for your help.  Much appreciated!
>
>
>
>
> On Thu, Jul 10, 2014 at 1:40 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>
> > Doing things this way you are using the neighborhoodID as a proxy for
> > userID/rowID in the recommender. I don’t see the benefit of the in-memory
> > version here since all output can easily be pre-calculated. Then it will
> > only be a lookup at runtime. You can use “rowsimilarity" on a single
> > machine without setting up a cluster, just use the local filesystem. This
> > is the way I’d do it.
> >
> > You definitely don’t want to “only load the columns (amenities) correlate
> > to a selected user” with an in-memory recommender. This loading of data
> > will trigger a retraining of the recommender before you can ask it for
> > similar neighborhoods and that will take more time than you want. This
> > potentially will happen as each new user visits your app.
> >
> > However if you train on all amenities for all neighborhoods then the
> > in-memory recommender should work and would train only once. Your data
> > would look like: (heighborhoodID, cafesID, numberOfCafes) and so on for
> > every non-zero cell in the table. And remember that ALL IDs must be
> Mahout
> > IDs—you can’t use your own IDs. Mahout IDs correspond to matrix
> > coordinants, they are ordinal Ints. Think of them as the row and column
> > number of the table.
> >
> > On Jul 10, 2014, at 10:45 AM, Edith Au <edith...@gmail.com> wrote:
> >
> > Thank you so much for the suggestions.  It took me sometime to figure
> > things out but I believe I have a pretty good grip on what's need to be
> > done now. My dataset is small enough to fit into a single machine so I am
> > going to use an in memory implementation rather than hadoop.   As
> suggested
> > by both Pat and Manuel, I have a table (in file system) with
> neighborhoods
> > as rows and amenities as columns.  In runtime, I will only load the
> columns
> > (amenities) correlate to a selected user and do a UserSimilarity
> operation
> > between each neighborhood and the one the user resides in.  After that, I
> > can pick up the NearestNUserNeighborhoods for results.
> >
> > I gather UserSimilarity is the in-memory equivalent of RowSimilarity
> > (Hadoop) ?  It would be great if someone can confirm it!
> >
> > Thanks again Pat and Manuel!
> > Edith
> >
> >
> > On Wed, Jul 2, 2014 at 4:06 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> >
> >> If you are looking to recommend a similar neighborhood based on the
> >> characteristics of some other neighborhood (the user’s current one) so
> > you
> >> wouldn’t use collaborative filtering. This is a metadata recommender
> > based
> >> on similarity of neighborhoods not a collection of user preferences.
> >>
> >> The easiest and fastest would be to use a search engine but I’ll leave
> >> that for now since it doesn’t account for feature weights as well.
> >>
> >> create a table like this:
> >> Neighborhood    Gym Cafe        Bookstore
> >> Downtown        15      50              0
> >> Midtown         30      100             10
> >> …
> >>
> >> You will need to convert the row IDs into sequential ints, which Mahout
> >> uses for IDs. Then read them into a sequenceFile creating a Distributed
> > Row
> >> Matrix, which has Key -  Value pairs. Keys = the integer neighborhood
> > IDs,
> >> the Value is a Vector (a sort of list) of column integer IDs with the
> >> counts.
> >>
> >> Then run rowsimilarity on the DRM. This is the CLI but there is also a
> >> Driver you can call from your code.
> >>
> >> There are some data prep issues you will have since larger neighborhoods
> >> will have higher counts. An easy thing to do would be to normalize the
> >> counts by something like population or physical size so you get cafes
> per
> >> resident or per sq mile or some other ratio.
> >>
> >> The result of the rowsimilarity job will be another DRM of key =
> >> neightborhood ID, values = Vector of similar neighborhoods (by integer
> > ID)
> >> with a strength of similarity. Sort the vector by strength and you’ll
> > have
> >> an ordered list of similar neighborhoods for each neighborhood.
> >>
> >> On Jun 30, 2014, at 12:48 PM, Edith Au <edith...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >>
> >> I am a newbie and am looking for some guidance to implement my
> >> recommender.  Any help would be greatly appreciated.  I have a small
> >> data set of location information with the following fields:
> >> neighborhood, amenities, and counts.  For example:
> >>
> >> Downtown          Gym 15
> >> Downtown          Cafe 50
> >> …
> >> Midtown             Gym 30
> >> Midtown             Cafe 100
> >> Midtown             Bookstore 10
> >> ...
> >> Financial Dist
> >> …
> >>
> >>
> >> so on and so forth.  I want to recommend a neighborhood for a user to
> >> reside base on the amenities (and some other metrics) in his/her
> >> current neighborhood.    My understanding is that model-based
> >> recommendation would be a good fit for the job.  If I am on the right
> >> track,  is there a experimental/beta recommender I can try?
> >>
> >>
> >> If there is no such recommender yet, can I still use Mahout for my
> >> project?  For example, can I implement my own Similarity which only
> >> computes the similarity between one user's preference to a set of
> >> neighborhood?  If I understand Mahout correctly, User/Item Similarity
> >> would do N x (N-1) pair of comparisons as oppose to 1 x N comparisons.
> >> In my example, User/Item Similarity would compare between Downtown,
> >> Midtown, Fin Dist -- which would be a waste in computation resources
> >> since the comparisons are not needed.
> >>
> >>
> >> Thanks in advance for your help.
> >>
> >> Edith
> >>
> >>
> >
> >
>
>

Reply via email to