Hey Erik,
The original code used ints to store the tv info. Upon merging this into the latest
code base, it no longer worked b/c Lucene now uses longs for term info (if I recall
correctly) and this caused a problem on merging/optimization (at least with the old
way of doing things.) After thinking about alternatives, I chose the String
implementation, as it allows term vectors to be used in non-optimized indexes and the
String is guaranteed to be unique w/in the index, thus making optimization very easy.
Of course, there may be alternatives that would allow one to use longs, I just didn't
see one that was efficient, so I made the assumption that disk space is cheap and went
with it.
-Grant
>>> [EMAIL PROTECTED] 06/12/04 12:54PM >>>
I'm digging deeper into the Lucene index format to develop some higher
level diagrams of its structure. One thing that is curious to me is
the term text being stored in the .tvf file. Why not point to the term
dictionary by position somehow and avoid duplicating this string,
saving possibly substantial index size? I'm assuming this is for
performance reasons.
Note, the Lucene index file formats documentation needs to be updated -
TermText is no longer just a String, it is a <PrefixLength,Suffix>
similar to how terms in the .tis are stored. I've updated
fileformats.xml/.html - if I've gotten this wrong, let me know.
Just out of curiosity - are there any other known inconsistencies with
the file formats documentation? I'd be happy to fix them up if there
are any other out of sync issues. I just happened to spot the one just
mentioned because I looked in the code to see how term vectors were
written when I saw that the term text is duplicated.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]