Sigh. Yeah, I agree that a simple big-O won't work for Lucene. But
nonetheless, we really should have some sort of performance
characterization. When people ask me about how to characterize Lucene/Solr
performance I always tell them that it is highly non-linear, with lots of
optimizations and options (tokenizers, stemming, case, n-grams, numeric
fields) and highly sensitive to the specifics of the data, so that
estimating performance or memory requirements is impractical. I mean, most
people don't have a handle on cardinality, actual data size, actual
document term counts, or data distribution, so even if we had an accurate
performance model most people wouldn't have accurate numbers to feed into
the model, especially since a lot of use cases involve data in the future
that nobody has seen yet. The average manager thinks they are on top of
performance and memory requirements when they can tell you how many raw
files and how many giga/tera-bytes of data they have, which clearly won't
feed into any sane model of Lucene performance.
Ultimately the best we can do is fall back on the model of doing a proof of
concept implementation and actually measuring performance and memory for a
significant sample of realistic data and then you can empirically deduce
who the big-O function is for your particular application data and data
model.
-- Jack Krupansky
On Fri, Nov 20, 2015 at 4:38 AM, Adrien Grand wrote:
> I don't think the big-O notation is appropriate to measure the cost of
> Lucene queries.
>
> Le mer. 11 nov. 2015 à 20:31, search engine a
> écrit :
>
> > Hi,
> >
> > I've been thinking how to use big O annotation to show complexity for
> > different types of queries, like term query, prefix query, phrase query,
> > wild card and fuzzy query. Any ideas?
> >
> > thanks,
> > Zong
> >
>