new to lucene, non standard index

2011-05-05 Thread Chris Schilling
Hi, I am trying to figure out how to solve this problem: I have about 500,000 files that I would like to index, but the files are structured. So, each file has the following layout: doc1 token1, weight11, frequency1, weight21 token2, weight12, frequency2, weight22 . . . etc for 500,000 docs.

Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov
Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields: id (doc#/token#) doc-id (doc#) token weight1 weight2 frequency Then search for token, sort by weight1, weight2 or frequency. If the token matches are unique within a document you will

Re: new to lucene, non standard index

2011-05-05 Thread Chris Schilling
Hey Mike, Let me clarify: The tokens are not unique. Let's say doc1 contains the token foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10 Now, let's say doc2 also contains the token foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5 Now, I want to search

Re: new to lucene, non standard index

2011-05-05 Thread Chris Schilling
Oh, yes, they are unique within a document. I was also thinking about something like this. But I would be replacing a large number of fields within a document by a large number of documents. Let me see if I can work that out. On May 5, 2011, at 3:01 PM, Mike Sokolov wrote: > Are the tokens

Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov
I think the solution I gave you will work. The only problem is if a token appears twice in the same doc: doc1 has foo with two different sets of weights and frequencies... but I think you're saying that doesn't happen On 05/05/2011 06:09 PM, Chris Schilling wrote: Hey Mike, Let me clarify:

Re: new to lucene, non standard index

2011-05-05 Thread Chris Schilling
Hey Mike, My only concern is that I am replacing a large number of fields inside of a Document with a (very large ~50e6) number of Documents. Will I not run into the same memory issues? Or do I create only one doc object and reuse it? With so many Doc/Token pairs, won't searching the index t

Re: new to lucene, non standard index

2011-05-06 Thread Michael Sokolov
I believe creating a large number of fields is not a good match w/the underlying architecture, and you'd be better off w/a large number of documents/small number of fields, where the same field occurs in every document. There is some discussion here: http://markmail.org/message/hcmt5syca7zdeac