On Jan 14, 2006, at 7:26 PM, Robert Kirchgessner wrote:
Lucene allows the user to change field definitions on the fly.
That's like an SQL database which auto-adapts the table definition
with each INSERT. It's impressive that Lucene can do that, but look
under the hood and you'll see that it ain't easy, or cheap.
Could you explain this or give me a hint where to look for why it
isn't cheap?
1) If the definitions are fixed, it's not necessary to check every
single Field object that gets added to see whether or not the field
definition has changed.
2) Field object templates can be cached, and clones spawned via a
factory method. These clones can already know their FieldInfo data
(including their field number), so they wouldn't have to check with a
FieldInfos object against their name to retrieve it.
3) Every time you add a document, Lucene constructs a FieldInfos, a
DocumentWriter, a FieldsWriter, a FieldInfosWriter,
and a TermInfosWriter. If the field definitions are specified in
advance, a single instance of each can be cached in IndexWriter (or
elsewhere) and reused. However, this is less useful in Lucene than
in KinoSearch: Lucene would need to keep opening and closing new
IndexOutput streams, whereas KinoSearch, which only writes one
segment per IndexWriter, can keep the same streams open.
4) Field numbers can be pre-assigned according to lexically sorted
field name. That's a big deal at index-time in KinoSearch because
the merge model depends on it; it wouldn't be important right away in
Lucene, but there may be some opportunities for optimization.
Those are the easy ones. There's more, but it would require a major
rewrite. If you're interested, perform a websearch for '"KinoSearch
Merge Model"' to find the previous post I sent to this list on the
subject.
I think it's OK to add fields to documents.
It enables creating additional fields at any time. Adding
fields should happen in a consistent way though. Maybe
throw an exception on adding a field with inconsistent definition.
IMO, there's no reason to allow field definitions to be spec'd more
often than once per IndexWriter. Need to add a new field for docs
501-1000 of a 1000-doc indexing pass? No problem: create a new
IndexWriter, define new fields, and you're off and running.
Even that example seems esoteric to me. Is it really necessary to be
able to define new fields "at any time"?
* Store field definitions in a single per-index,
human-readable file.
I like the idea of a per-index field definitions file, be it
human-readable or not.
The human-readable part isn't terribly important. I just found it
difficult and frustrating to troubleshoot problems with the file
format when I was trying to write compliant code, since nothing is
human-readable and there are no fixed block sizes for anything. I
had to resort to binary comparisons of Lucene-generated indexes vs.
KinoSearch-generated indexes from identical data. Out of curiosity,
does PHPLucene write Lucene-compatible indexes? KinoSearch will only
when the source data is pure ascii with no null bytes, since it
defines a String as arbitrary data preceded by a VInt byte-count.
If the field definitions aren't going to be fixed per-index, then
they ought to stay with the segment. Otherwise, you have to store
per-segment data in a central location and update it every merge.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]