On 29 Nov 2003, at 18:15, Erik Hatcher wrote:
On Saturday, November 29, 2003, at 11:47 AM, Stefano Mazzocchi wrote:I would AND together a PrefixQuery for URI "/files/whatever" (allowing it to search a sub-tree rooted there) with a RangeQuery on field "contentlength" for values 40000 and greater.
Hmmm. This seems to have O(n) complexity on scoping. I was aiming to obtain a O(1) complexity on scoping and O(n) on the rest (WHERE and ORDER). O(n) on scoping is not going to be performant enough on very large collections of documents.
I'm not sure what you mean by scoping here (the URI path?)
yes, "scoping a DASL query" means "restricting the URL space where it applies". This is, IMO, a great benefit for many types of operation.
Please note that unlike the normal use of text-based engines like lucene, DASL can be used as a high-level programmatic way to discover documents. for example, in my current project, i would like to have a publishing system that is capable of processing XML documents stored into a repository but they should follow a little workflow, so I would like to say:
give me all the documents that reached the final stage of approuval in this particular folder
to me, the *killing* feature of WebDAV is the concept of "metadata-extensible file sytsem". All POSIX file system fail to provide this (only ReiserFS4 implements what they call "pseudo-files", but a linux-only thing). And DASL is the way I can do "ls/dir" while matching my own properties for the values I want.
Of course, this is *NOT* the only use of DASL, but I think it's the most important one for content repositories (and, well, I would *love* to have an xquery-capable DASL compatibility on an XML repository, but that's a different story... hopefully JSR-170 will open up the gates for all that).
I really don't care how this is implemented. Lucene, SQL-translation, flat files, hashtables... I really don't care, and it doesn't really have to be fast... I'm way more concerned about scalability than speed. Scalability in terms of number of documents in the repository.
I need a system to be able to scale to millions of documents with a few thousands in each directory (all under version control, each file with potentially 100 versions each!) and the DASL search on a folder for all the files that have a particular property with a particular value should be subsecond on a normal machine.
After that, I'm happy for a few years :-)
, but Lucene can very quickly narrow a range of considered documents down based on the lexicographic range of the terms being queried. For example, if contentlength is a term, only the documents that have a value greater than 40000 would be considered, and getting to that list of documents is quite rapid.
suppose I have a repository like this
/n/m/document.xml /n/m/image_1.jpg ... /n/m/image_q.jpg
where n = {1...1000}, m = {1...1000}, q = {0...10}
[consider it an asset repository]
the query that I will run the most on this repository (SQL-ized from DASL) is
SELECT displayname,lastmodifiedtime FROM /n WHERE blah:published = "true"
how do you envision lucene implementing this?
[I'm not critic or caustic, I'm just very curious, really!]
If you're doing full-text searching combined with these types of conditions and want the order to be by how well the documents match your query then Lucene will shine.
yes, but in that case, I think lucene should handle its own query language thru a specific DASL implementation.... using a text-oriented search engine for relational stuff is, IMO, a little abusive.
Definitely understood. Keep in mind that Lucene is my hammer at the moment and the world is my nail :) So I'll put for a very Lucene-centric take on things since I'm immersed in it at the moment.
This is definately appreciated: the more approached we propose, the better the results. As I said, I'm not a database expert and I'm not a lover of any particular database or searching technology... I just want stuff to work the way I need it, that's all.
Traditional relational database type of queries with ORDER BY clauses don't map as well. Ordering, though, can be applied after the query results are returned in this case as you will want to collect all documents that match the query anyway. I'd almost be willing to bet that Lucene will beat most, if not all, relational databases here especially in this case where the hierarchy is being recursively traversed.
Not sure about that.
Lucene is not relational, so it will have to scan the entire list of documents if they belong to a particular scope or not.
Again, I'm not sure what you mean by scope, but unless you're doing something like a WildcardQuery or FuzzyQuery, it will not have to scan all documents. Lucene "scans" by term, not by document. It is an inverted index and walking documents is not something done when searching, generally speaking - it walks the terms requested and then gives back the document id's that match a query.
Yes, I understood this, but the only recurring term in the above query is the "published" = "true". Lucene could easily find out the list of published documents in the repository, but this could sum up to several hundreds of thousands... then it would have to nail down the scope by finding out which of those documents have a URL that "begins" with "/n"... note "begins": finding out that the URL contains "/n" is *NOT* enough as is might lead to false positives.
If you index into a Lucene document a field called "path" that looks like filesystem paths: "/files", "/files/whatever", "/files/whatever/..." and then use a PrefixQuery, only the terms that begin with the path specified are enumerated - making it a recursive query essentially, but in a very rapid term range enumeration under the covers.
I'm not sure I understood this, can you please elaborate more?
note that since DASL queries will be more or less the same all the time, it is possible to think at a relational model that will optimize them greatly.
Sure. Again, I agree that Lucene may be overkill. But I enjoy this discourse to see where Lucene may fall short. Thus far, I still think Lucene can do the job although certainly not as straightforwardly as a SQL-like query on a relational model.
I really don't care about how straightforward it is as long as it does the job and does it scaling. :-)
-- Stefano.
smime.p7s
Description: S/MIME cryptographic signature
