On 30 Nov 2003, at 19:00, Pier Fumagalli wrote:

On 29 Nov 2003, at 22:05, Stefano Mazzocchi wrote:

I really don't care how this is implemented. Lucene, SQL-translation, flat files, hashtables... I really don't care, and it doesn't really have to be fast... I'm way more concerned about scalability than speed. Scalability in terms of number of documents in the repository.

On a side note in terms of scalability... Currently our Lucene is indexing 95569 documents, and it's a real breeze... This year we corrupted the index twice, in both cases not due to Lucene, but due to the fact that the VM crashed while Lucene was optimizing the indexes.

FWIW, Lucene scales perfectly, and we never had a single problem with it...

I perfectly understand that Lucene is a breeze, but you are using it "the right way", you are using it as it was designed to be used: as a reverse token index.

My concerns (but I admit my ignorance on Lucene internals, so they are just loud concerns not facts) are only about using it hybridly.

Lucene scalability is not impaired by the number of documents. You basically create a matrix document/token and then create an hashtable of the tokens and you get the documents (modulo how ranking is performed, thru, I believe, sorting euclidean distance in the document vector space between the query and the documents found)

That's nice, has been used for decades in all full-text search engines and can be optimized a lot (and lucene is a nice implementation of those algorithms).

But how do I use this for something that looks a lot like a relational query?

My biggest fear is hitting the O(n) complexity: it might still run like a breeze with 100 documents, but could crawl on its knees if you reach 10000... and by the time you realize this, it's where you need the repository the most because your data gets big and unmanageable without a repository!

Eric suggests that there could be ways to index documents and its properties into lucene and then use DASL on it. What I want to understand is the algorithmical complexity of such an approach.

if it can be made O(1) or even O(log(n)), I'm sold. but if this gets O(f(n)) where n is the document number and f(n) grows more than log(n), well, we have a problem.

--
Stefano.

Attachment: smime.p7s
Description: S/MIME cryptographic signature



Reply via email to