Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2009-04-21 Thread Thomas Pönitz
Hi, I have the same problem as discussed here: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/%3c200511021310.18686...@last.fm%3e I want to specify termvectors directly instead of constructing a dummy string like "a a a b b c" that will be transformed to a[3] b[2] c[1].

Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
Hi, I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for various things, and i've run into a situation that seems somewhat inelegant regarding populating fields which i already know the termvector for. I'm creating a document for each user (last.fm tracks music taste for pe

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2009-04-24 Thread Michael McCandless
I don't think there's an easy way to jump straight from term + freq per doc to a Lucene index. Mike On Tue, Apr 21, 2009 at 7:14 AM, Thomas Pönitz wrote: > Hi, > > I have the same problem as discussed here: > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/%3c200511021310.1

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: Hi, I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for various things, and i've run into a situation that seems somewhat inelegant regarding populating fields which i already know the termvector for. I'm creating a document for each user (last.fm t

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Erik Hatcher
On 2 Nov 2005, at 08:10, Richard Jones wrote: If i've listened to Radiohead (id 1) 10 times, Coldplay (id 2) 5 times and Beck (id 3) 2 times, the field would look like this "1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3" I use this index for quickly finding "top fans" of an artist or combination of

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
> I can think of a few ways. If elegance is your goal, then a little > relational database theory might help. Specifically, instead of having > one record per listener, have one record per listener-artist > combination, with three fields: listenerid, artistid, and count. Your > example above wo

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
Hi Erik Our lucene-powered music search went live this week, so your search should work now: http://www.last.fm/explore/search.php?q=Michael+Hedges Before we discovered lucene our search sucked *really* badly ;) Adding multiple fields like this is similar to what i'm doing now (i am using whites

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Grant Ingersoll
Not sure if this is feasible, but is there someway you could use a "fake" analyzer that you constructed using your hashtable/termvector and then have it output the tokens directly from the hashtable via the TokenStream? Maybe you would have to pass in an empty/dummy string to the field constru

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: The data i'm dealing with is stored over a few mysql dbs on different machines, horizontally partitioned so each user is assigned to a single db. The queries i'm doing can be done in SQL in parallel over all machines then combined, which i've tested - it's unacceptably slo

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
> If you're willing to continue subsetting / summarizing the data out into > Lucene, how about subsetting it out into a dedicated MySQL instance for > this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int = > roughly 1 GB of data, which would easily fit into RAM. Queries should > be pret

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: If you're willing to continue subsetting / summarizing the data out into Lucene, how about subsetting it out into a dedicated MySQL instance for this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int = roughly 1 GB of data, which would easily fit into RAM. Queries

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
> Ah, so the fact that "1" actually appears many times in the string you > give Lucene is important. Neat application! > > Sounds like the custom Analyzer (really a custom TokenStream) approach > suggested by others may be the way for you to go. If the information > you get from the MySQL profile