Re: Question about FieldInfos

Marvin Humphrey Sat, 14 Jan 2006 23:42:41 -0800


On Jan 14, 2006, at 7:26 PM, Robert Kirchgessner wrote:

Lucene allows the user to change field definitions on the fly.
That's like an SQL database which auto-adapts the table definition
with each INSERT.  It's impressive that Lucene can do that, but look
under the hood and you'll see that it ain't easy, or cheap.

Could you explain this or give me a hint where to look for why itisn't cheap?

1) If the definitions are fixed, it's not necessary to check everysingle Field object that gets added to see whether or not the fielddefinition has changed.

2) Field object templates can be cached, and clones spawned via afactory method. These clones can already know their FieldInfo data(including their field number), so they wouldn't have to check with aFieldInfos object against their name to retrieve it.

3) Every time you add a document, Lucene constructs a FieldInfos, aDocumentWriter, a FieldsWriter, a FieldInfosWriter,and a TermInfosWriter. If the field definitions are specified inadvance, a single instance of each can be cached in IndexWriter (orelsewhere) and reused. However, this is less useful in Lucene thanin KinoSearch: Lucene would need to keep opening and closing newIndexOutput streams, whereas KinoSearch, which only writes onesegment per IndexWriter, can keep the same streams open.

4) Field numbers can be pre-assigned according to lexically sortedfield name. That's a big deal at index-time in KinoSearch becausethe merge model depends on it; it wouldn't be important right away inLucene, but there may be some opportunities for optimization.

Those are the easy ones. There's more, but it would require a majorrewrite. If you're interested, perform a websearch for '"KinoSearchMerge Model"' to find the previous post I sent to this list on thesubject.

I think it's OK to add fields to documents.
It enables creating additional fields at any time. Adding
fields should happen in a consistent way though. Maybe
throw an exception on adding a field with inconsistent definition.

IMO, there's no reason to allow field definitions to be spec'd moreoften than once per IndexWriter. Need to add a new field for docs501-1000 of a 1000-doc indexing pass? No problem: create a newIndexWriter, define new fields, and you're off and running.

Even that example seems esoteric to me. Is it really necessary to beable to define new fields "at any time"?

     * Store field definitions in a single per-index,
       human-readable file.


I like the idea of a per-index field definitions file, be it
human-readable or not.

The human-readable part isn't terribly important. I just found itdifficult and frustrating to troubleshoot problems with the fileformat when I was trying to write compliant code, since nothing ishuman-readable and there are no fixed block sizes for anything. Ihad to resort to binary comparisons of Lucene-generated indexes vs.KinoSearch-generated indexes from identical data. Out of curiosity,does PHPLucene write Lucene-compatible indexes? KinoSearch will onlywhen the source data is pure ascii with no null bytes, since itdefines a String as arbitrary data preceded by a VInt byte-count.

If the field definitions aren't going to be fixed per-index, thenthey ought to stay with the segment. Otherwise, you have to storeper-segment data in a central location and update it every merge.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question about FieldInfos

Reply via email to