Richard Jones wrote:
Hi,
I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for various things, and i've run into a situation that seems somewhat inelegant regarding populating fields which i already know the termvector for.

I'm creating a document for each user (last.fm tracks music taste for people), with a field that depicts a users favourite 500 artists. Each artist is represented by an integer, here's a simple example with 3 artists:

If i've listened to Radiohead (id 1) 10 times, Coldplay (id 2) 5 times and Beck (id 3) 2 times, the field would look like this "1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3"

I use this index for quickly finding "top fans" of an artist or combination of artists, comparing peoples music taste and other things on the fly. The issue is that i already have the termvecor (radiohead=10, coldplay=5, beck=2) handy as a hashtable, and i've found myself building up a string of numbers separated by spaces as shown above, then feeding this into lucene (i store the termvec of the field in lucene). Is there a way i could pass a termvector directly to lucene to cut out the ugly "turn it into a string and let lucene parse it" step? basically i want to provide the termvector for a field when inserting a new document, rather than let lucene build it by analyzing a string.

I can think of a few ways. If elegance is your goal, then a little relational database theory might help. Specifically, instead of having one record per listener, have one record per listener-artist combination, with three fields: listenerid, artistid, and count. Your example above would then look like
listenerid  artistid  count
----------  --------  -----
         X         1  00010
         X         2  00005
         X         3  00002

You could compose queries to get all artists somebody every listened to (listenerid:X), all Radiohead listeners (artistid:1), anybody who listened to Coldplay 5 or more times (artistid:2 and count:[00005 to 99999]) or what-have-you. This approach would require two-stage processing for queries of the form "find everybody who listened to Radiohead three times and Coldplay twice", though.

Really, though, your problem sounds more like a relational db problem than a text search problem. A simple MySQL database with a few tables might be a better fit ...

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to