On 4/28/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
I have a few things I'd like to check with the Luke handler, if you call
could check some of the assumptions, that would be great.

* I want to print out the document frequency for a term in a given
document.  Since that term shows up in the given document, I would think
the term frequency must be > 1.  I am using: reader.docFreq( t ) [line
236] The results seem reasonable, but *sometimes* it returns zero... is
that possible?

Is the field indexed?
Did you run the field through the analyzer to get the terms (to match
what's in the index)?
If both of those are true, it seems like the docFreq should always be
greater than 0.

* I want to return the lucene field flags for each field.  I run through
all the field names with:
reader.getFieldNames(IndexReader.FieldOption.ALL).  Is there a way to
get any Fieldable for a given name?  IIUC, all terms with the same name
will have the same flags.  I tried searching for a document with that
field, it works, but only for stored fields.

* I just realized that I am only returning stored fields for get
getDocumentFieldsInfo() (it uses Document.getFields())  How can I get
find *all* Fieldables for a given document?  I have tried following the
luke source, but get a bit lost ;)

LOL... if it's an inverted index, it's difficult and time consuming to
try and reconstruct what a non-stored field value was.

In an inverted index, terms point to documents.   So you have to
traverse *all* of the terms of a field across all documents, and keep
track of when you run across the document you are interested in.  When
you do, then get the positions that the term appeared at, and keep
track of them.  After you have covered all the terms, you can put
everything in order.  There could be gaps (positionIncrement, stop
word removal, etc) and it's also possible for multiple tokens to
appear at the same position.

For a full-text field with many terms, and a large index, this could
take a *long* time.
It's probably very useful for debugging though.

* Each field gets an boolean attribute "cacheableFaceting" -- this true
if the number of distinct terms is smaller then the filterCacheSize.  I
get the filterCacheSize from: solrconfig.xml:"query/filterCache/@size"
and get the distinctTerm count from counting up the termEnum.  Is this
logic solid?  I know the cacheability changes if you are faciting
multiple fields at once, but its still nice to have a ballpark estimate
without needing to know the internals.

It could get trickier... I'm about to hack up a quick patch now that
will reduce memory usage by only using the filterCache  above a
certain df threshold.  It may increase or
decrease the faceting speed - TBD.

Also, other alternate faceting schemes are in the works (a month or two out).
I'd leave this attribute out and just report on the number of unique terms.
Some kind of histogram might be really nice though (how many terms
under varying df values):
 1=>412  (412 terms have a df of 1)
 2=>516  (516 terms have a df of 2)
 4=>600
 8=>650
16=>670
32=>680
64=>683
128=>685
256=>686
11325=>690  (the maxDf found)

Remember that df is not updated when a document is marked for deletion
in Lucene.
So you can have a df of 2, do a search, and only come up with one document.

-Yonik

Reply via email to