If you have tens of millions of documents, almost all with unique fields
that you're sorting on, you'll chew through memory like there's no
tomorrow.

Have you looked at trie fields? See:
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/

I'm a little concerned that the user can sort on Data. Any field used for
sorting
should NOT be analyzed, so unless you are indexing "Data" unanalyzed, that's
a problem. And if you are sorting on strings unique to each document, that's
also a memory hog. Not to mention whether capitalization counts.

You might enumerate the terms in your index for each of the sortable fields
to figure out what the total number of unique terms each is and use that as
a basis for reducing their count....

HTH
Erick

On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau <aka...@gmail.com> wrote:

> Hi Erick,
>
> Here's some more details about our structure. First here's an example of
> document in our index :
>
>     PrimaryKey        = SJAsfsf353JHGada66GH6 (it's a hash)
>     DocType           = X
>     Data              = This is the data
>     SearchableContent = This is the data
>     DateCreated       = <timestamp>
>     DateModified      = <timestamp>
>     Counter1          = 17
>     Counter2          = 3
>     Average           = 0.17
>     Cost              = 200
>
> The users are able to sort on almost all fields: Data, DateCreated,
> DateModified, Counter1, Counter2, Average, Cost.
>
> When we search, we always search on the 'SearchableContent' field and we
> have at least one filter on the DocType (because we have many document
> types
> in the same index). So a common search that would find the document above
> is
> "data *AND DocType:X*" (we automatically add the "*AND DocType:X*" part
> using Lucene Filters.
>
> I would say that the number of unique terms in the field being sorted on is
> very big - for example timestamps, almost all unique, counters, average,
> cost, data... so if a query finds 10M results, it's almost 10M different
> values to sort. About cache and warm-up queries : we don't use warm-up
> queries -at all- because we have absolutely no idea of what users are going
> to search for (they can search for absolutely anything). About "returning
> 10M" documents, right, we don't actually return the 10M documents, we use
> pagination to return documents X to Y of the 10M (and the 10M was only an
> example, it can be anywhere between 1K and 100M results). The pagination
> usually works fine and fast, our problem is really sorting.
>
> Our "Lucene Reader" process has 2GB of ram allowed, here's how I start it -
>
>     java -Xmx2048m -jar LuceneReader.jar
>
> The problem really seems to be a ram problem, but I can't be 100% sure (any
> help about how to be sure is welcome).
>
> Our current idea of a solution would be to get maximum 250 results (the
> more
> relevant ones; more results than that is totally useless in our system) so
> the sort should work fine on a small data set like that, but we want to
> make
> sure we're doing everything right before doing that so we don't run in the
> same problems again.
>
> Thank you very much; let me know if you need any more details!
>
> - Mike
> aka...@gmail.com
>
>
> On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Let's back up a minute. The number of matched records is not
> > important when sorting, what's important is the number of unique
> > terms in the field being sorted. Do you have any figures on that? One
> > very common sorting issue is sorting on very fine date time resolutions,
> > although your examples don't include that...
> >
> > Now, cache loading is an issue. The very first time you sort on a field,
> > all the values are read into a cache. Is it possible this is the source
> > of your problems? You can cure this with warmup queries. The take-away
> > is that measuring the response time for the first sorted query is
> > very misleading.
> >
> > Although if you're sorting on many, many, many email addresses,
> > that could be "interesting".
> >
> > The comment "returning 10,000,000 documents" is, I hope, a
> > misstatement. If you're trying to *return* 10M docs sorting
> > is irrelevant compared to assembling that many documents. If
> > you're trying to return the first 100 of 10M documents, it should
> > work.
> >
> > Overall, we need more details on what you're sorting and what
> > you're trying to return as well as how you're measuring before
> > we can say much....
> >
> > Along with how much memory you're giving your JVM to work with,
> > what "exploding" means. Are you CPU bound? IO bound? Swapping?
> > You need to characterize what is going wrong before worrying about
> > solutions......
> >
> > Best
> > Erick
> >
> > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau <aka...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > we are building an application using Lucene and we have HUGE data sets
> > (our
> > > index contains millions and millions and millions of documents), which
> > > obviously cause us very important problems when sorting. In fact, we
> > > disabled sorting completely because the servers were just exploding
> when
> > > trying to sort results in RAM. The users using the system can search
> for
> > > whatever they want, so we never know how many results will be returned
> -
> > a
> > > search can return 10 documents (no problem with sorting) or 10,000,000
> > > (huge
> > > sorting problems).
> > >
> > > I woke up this morning and had a flash : is it possible with Lucene to
> > have
> > > a "natural sorting" of documents? For example, let's say I have 3
> columns
> > I
> > > want to be able to sort by : first name, last name, email; I would have
> 3
> > > different indexes with the very same data but with a different primary
> > key
> > > for sorting. I know it's far fetched, and I have never seen anything
> like
> > > that since I use Lucene, but we're just desperate... how people do to
> > have
> > > huge data sets, a lot of users, and sort!?
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> >
>

Reply via email to