Hi Erick,
Erick Erickson wrote:
Ah, I may have misunderstood, I somehow got it in my mind
you were talking about the length of each term (as in string length).
But if you're looking at the field length as the count of terms, that's
another question, sorry for the confusion...
I have to ask, though, why you want to sort this way? The relevance
calculations already factor in both term frequency and field length. What's
the use-case for sorting by field length given the above?
It's not a real world use-case -- I just want to get a better
understanding of the data I'm indexing (therefore, performance is
neglectable). In my current use case, you can think of the field length
as an indicator of data quality (i.e., the longer the field content, the
worse the quality is). Being able to sort the field data in order of
decreasing length would allow me to investigate "exceptional" data items
that are not appropriately handled by my curation process.
Best,
Sascha
Best
Erick
On Tue, May 25, 2010 at 3:40 AM, Sascha Szott<sz...@zib.de> wrote:
Hi Erick,
Erick Erickson wrote:
Are you sure you want to recompute the length when sorting?
It's the classic time/space tradeoff, but I'd suggest that when
your index is big enough to make taking up some more space
a problem, it's far too big to spend the cycles calculating each
term length for sorting purposes considering you may be
sorting all the terms in your index worst-case.
Good point, thank you for the clarification. I "thought" that Lucene
internally stores the field length (e.g., in order to compute the relevance)
and getting this information at query time requires only a simple lookup.
-Sascha
But you could consider payloads for storing the length, although
that would still be redundant...
Best
Erick
On Mon, May 24, 2010 at 8:30 AM, Sascha Szott<sz...@zib.de> wrote:
Hi folks,
is it possible to sort by field length without having to (redundantly)
save
the length information in a seperate index field? At first, I thought to
accomplish this using a function query, but I couldn't find an
appropriate
one.
Thanks in advance,
Sascha