On Jan 14, 2006, at 7:26 PM, Robert Kirchgessner wrote:

Lucene allows the user to change field definitions on the fly.
That's like an SQL database which auto-adapts the table definition
with each INSERT.  It's impressive that Lucene can do that, but look
under the hood and you'll see that it ain't easy, or cheap.

Could you explain this or give me a hint where to look for why it isn't cheap?

1) If the definitions are fixed, it's not necessary to check every single Field object that gets added to see whether or not the field definition has changed.

2) Field object templates can be cached, and clones spawned via a factory method. These clones can already know their FieldInfo data (including their field number), so they wouldn't have to check with a FieldInfos object against their name to retrieve it.

3) Every time you add a document, Lucene constructs a FieldInfos, a DocumentWriter, a FieldsWriter, a FieldInfosWriter, and a TermInfosWriter. If the field definitions are specified in advance, a single instance of each can be cached in IndexWriter (or elsewhere) and reused. However, this is less useful in Lucene than in KinoSearch: Lucene would need to keep opening and closing new IndexOutput streams, whereas KinoSearch, which only writes one segment per IndexWriter, can keep the same streams open.

4) Field numbers can be pre-assigned according to lexically sorted field name. That's a big deal at index-time in KinoSearch because the merge model depends on it; it wouldn't be important right away in Lucene, but there may be some opportunities for optimization.

Those are the easy ones. There's more, but it would require a major rewrite. If you're interested, perform a websearch for '"KinoSearch Merge Model"' to find the previous post I sent to this list on the subject.

I think it's OK to add fields to documents.
It enables creating additional fields at any time. Adding
fields should happen in a consistent way though. Maybe
throw an exception on adding a field with inconsistent definition.

IMO, there's no reason to allow field definitions to be spec'd more often than once per IndexWriter. Need to add a new field for docs 501-1000 of a 1000-doc indexing pass? No problem: create a new IndexWriter, define new fields, and you're off and running.

Even that example seems esoteric to me. Is it really necessary to be able to define new fields "at any time"?

     * Store field definitions in a single per-index,
       human-readable file.

I like the idea of a per-index field definitions file, be it
human-readable or not.

The human-readable part isn't terribly important. I just found it difficult and frustrating to troubleshoot problems with the file format when I was trying to write compliant code, since nothing is human-readable and there are no fixed block sizes for anything. I had to resort to binary comparisons of Lucene-generated indexes vs. KinoSearch-generated indexes from identical data. Out of curiosity, does PHPLucene write Lucene-compatible indexes? KinoSearch will only when the source data is pure ascii with no null bytes, since it defines a String as arbitrary data preceded by a VInt byte-count.

If the field definitions aren't going to be fixed per-index, then they ought to stay with the segment. Otherwise, you have to store per-segment data in a central location and update it every merge.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to