Using a MySQL DataModel reduced both the time and memory footprint of the
example. Although there is no BooleanPrefUserMySQLJDBCDataModel so I had to
make my own.

I made a subclass of MySQLJDBCDataModel and had to copy some of the code
from both AbstractJDBCDataModel. I'm thinking there could be a better way to
do this but would require a bit of a refactor. I might give it a go today if
I have time. I'd also like to have the queries injected so we could make a
pure JDBC Model and not require subclasses for each SQL database out there.
I'll have a look at that too.

But anyway, back to the initial subject of this thread:

It's taking ages to import all my data to MySQL (I'm at 2.5 / 5 Million rows
- a powercut over the weekend meant it did not finish before arriving at
work GRRR), but on 2.5 million rows with no indexes on the user_idcolumn
it's taking about a minute to get a recommendation. I'm guessing with an
index on user_id column it will be even quicker (also, if I'm not importing
millions of rows in the background).

The only issue I have with recommendations from the
BooleanTanimotoCoefficientSimilarity is that there is no way to order the
results as they all come out with a value of 1. So the least relevant item
may come out at the top. So instead of using a recommender, what I do is get
the items from the 20 nearest neighbours, remove from that list the items my
user has and then tally up the remainder, such that the items which appear
the most often go higher up the list. Like so:

        UserNeighborhood neighborhood =
            new NearestNUserNeighborhood(20, userSimilarity, model);

        User user = model.getUser("16501");

        //User items
        List<Item> userItems = new ArrayList<Item>();
        for (Preference pref : user.getPreferences()) {
            Item item = pref.getItem();
            userItems.add(item);
        }

        Set<Item> recommendations = new SortedCountSet<Item>();
//SortedCountSet creates an iterator ordered by the highest frequency of
adds of a given item.
        Collection<User> userNeighborhood =
neighborhood.getUserNeighborhood(user.getID());

        //Neighbour items
        for (User neighbour : userNeighborhood) {
            for (Preference pref : neighbour.getPreferences()) {
                Item item = pref.getItem();

                if (userItems.contains(item)) {
                    continue;
                }
                recommendations.add(item);
            }
        }

As a n00b to Mahout (Taste) I may have missed a recommender which does this
for me?

On Fri, May 1, 2009 at 7:10 PM, Sean Owen <[email protected]> wrote:

> As a small follow-up on this, here's a small result that should hold --
>
> Setting the sampling rate to, say, 1/X (i.e. if you set it to 20%,
> X=5), should reduce the time spent in finding a neighborhood by a
> factor of X. Of course. Assuming users are pretty evenly scattered
> around your rating-space, the average distance to users in your
> computed neighborhood also increases by a factor of X.
>
> So you get results X times faster, but the results you get are X times
> 'worse'. This sounds bad but consider that users 5 times farther away
> in your rating-space may still be suitable neighbors and yield the
> same recommendations.
>
> On Fri, May 1, 2009 at 8:32 AM, Sean Owen <[email protected]> wrote:
> > It really depends on the nature of the data and what tradeoff you want
> > to make. I have not studied this in detail. Anecdotally, on a
> > large-ish data set you can ignore most users and still end up with an
> > OK neighborhood.
> >
> > Actually I should do a bit of math to get an analytical result on
> > this, let me do that.
>



-- 
---------------------------------------------
Paul Loy
[email protected]
http://www.keteracel.com/paul

Reply via email to