Re: Lucene and DASL (was Re: Interoperability between webdavfs and Apache Slide)

Mike Oliver Sun, 30 Nov 2003 06:59:03 -0800

Erik Hatcher wrote:

On Saturday, November 29, 2003, at 05:05 PM, Stefano Mazzocchi wrote:

I'm not sure what you mean by scoping here (the URI path?)

yes, "scoping a DASL query" means "restricting the URL space where it applies". This is, IMO, a great benefit for many types of operation.

And this can be a trivial operation with Lucene. PrefixQuery essentially walks all the terms (aka URI's in this case) that start with a specified string - in other words, it only does as much work as it needs to do, but no more.

give me all the documents that reached the final stage of approuval in this particular folder

If that flag is in Lucene somehow, it would be, again, a quick (TermQuery in this case) operation. The drawback to Lucene in this case, though, is that updating a document requires removing and re-indexing - so if flags are changing all the time, but the content is not then I would recommend against Lucene for these types of flags and queries.

This is still a fun discourse and allowing me to mentally experiment with Lucene's limitations. I definitely agree that a hybrid approach will be best, where traditional queries go against some type of relational model, but full-text ones go against Lucene indexes. How queries that possibly cross both types are handled is the tricky question now.

I too am enjoying this thread and have just a tidbit to add to it and I am not yet a Lucene expert and likely never to the degree of Erik. At Open Text (Livelink) we had the model of SQL queries for metadata and the Open Text Index as the Full Text search and we even combined them for permissions filtering. However, we eventually dropped the the SQL queries for the metadata in favor of a unified index. The metadata was converted to XML for the indexing, then discarded, through an event driven process whenever the metadata on a document changed. In Livelink there are many document types including many that are essentially metadata/properties alone. This unified index then performed better, had a better hit ratio and was simpler.

We (no longer Open Text) are taking the same approach with our extensions to Slide where we are treating the XML Descriptor as the primary Store and have a Secondary Store for JDBC or whatever is needed.

In the Livelink design the index was capable of defining "regions" which in effect were the XML elements for the metadata and roughly equated to the SQL database columns. I assume that Lucene has a similar capability?

to me, the *killing* feature of WebDAV is the concept of "metadata-extensible file sytsem". All POSIX file system fail to provide this (only ReiserFS4 implements what they call "pseudo-files", but a linux-only thing). And DASL is the way I can do "ls/dir" while matching my own properties for the values I want.

Yeehaw! Isn't the BeOS filesystem supposed to be kickass in this respect? Isn't this coming to Mac OS X in the future?

I need a system to be able to scale to millions of documents with a few thousands in each directory (all under version control, each file with potentially 100 versions each!) and the DASL search on a folder for all the files that have a particular property with a particular value should be subsecond on a normal machine.

After that, I'm happy for a few years :-)

You're easily pleased, huh?! :)) Ya definitely got grand ideals - but they seem achievable.
suppose I have a repository like this
 /n/m/document.xml
 /n/m/image_1.jpg
 ...
 /n/m/image_q.jpg
where n = {1...1000}, m = {1...1000}, q = {0...10}

[consider it an asset repository]

the query that I will run the most on this repository (SQL-ized from DASL) is
 SELECT displayname,lastmodifiedtime
 FROM /n
 WHERE blah:published = "true"
how do you envision lucene implementing this?
If there are queries that run commonly, there are ways to optimize things with Lucene using a Filter. QueryFilter can filter one query based on the results of a previous query (search within search type of thing). Doing a TermQuery on the published flag and then creating a filter is one approach. With URI stored as a keyword field, the "/n" recursive descent is a PrefixQuery (aka "starts with") and is very rapid within Lucene.

The field SELECT clause is not Lucene-like though - when you retrieve a document from Lucene, you get all fields.

Like I said earlier though - with a lot of metadata changes happening without content changes, Lucene is not necessarily ideal. Although there are other tricks like using multiple Lucene indexes, one for metadata and one for content, or perhaps even more indexes based on URI or store or some other granularity like an index per property type?

[I'm not critic or caustic, I'm just very curious, really!]

Me too!

Again, I'm not sure what you mean by scope, but unless you're doing something like a WildcardQuery or FuzzyQuery, it will not have to scan all documents. Lucene "scans" by term, not by document. It is an inverted index and walking documents is not something done when searching, generally speaking - it walks the terms requested and then gives back the document id's that match a query.

Yes, I understood this, but the only recurring term in the above query is the "published" = "true". Lucene could easily find out the list of published documents in the repository, but this could sum up to several hundreds of thousands... then it would have to nail down the scope by finding out which of those documents have a URL that "begins" with "/n"... note "begins": finding out that the URL contains "/n" is *NOT* enough as is might lead to false positives.

Note "prefix" in PrefixQuery :) There are two types of indexed fields in Lucene, keyword and text. Keyword fields (which is what would be used for URI) is indexed as a single token, not analyzed and split up into tokens like content text would be. The best operations on that type of field would be "equals" (TermQuery) or "begins with" (PrefixQuery) - both of which would do no more work than needed to retrieve the documents requested. I would never propose a "contains" query on URI (although the inefficient WildcardQuery in Lucene could do it).

If you index into a Lucene document a field called "path" that looks like filesystem paths: "/files", "/files/whatever", "/files/whatever/..." and then use a PrefixQuery, only the terms that begin with the path specified are enumerated - making it a recursive query essentially, but in a very rapid term range enumeration under the covers.

I'm not sure I understood this, can you please elaborate more?

Perhaps my description above clarified this more?

For example, in my blog, I have entries within categories. Each blog entry is indexed as a single Lucene Document, with a "path" field like "/Computers/Programming". Someone can navigate to my blog at either the "/", "/Computers", or "/Computers/Programming" level and see all the entries that fall below the requested category (and subcategories). A PrefixQuery is used based on the servlet request path. (on a side note, I also put /YYYY/MM/DD of the blog entry creation date into the path field as another term, making category or date browsing work identically).

Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and DASL (was Re: Interoperability between webdavfs and Apache Slide)

Reply via email to