Re: "Natural sorting" of documents in a Lucene index - possible?

Ian Lea Wed, 18 Aug 2010 07:38:32 -0700

> But - to come back to my original question... is there any way to have a
> "natural order" of documents other that the DocId In Lucene?


No.


--
Ian.


On Wed, Aug 18, 2010 at 3:21 PM, Michel Nadeau <[email protected]> wrote:
> Cool, so I'll try these things -
>
> * Replace timestamps with YYYYMMDD - will minimize unique terms count;
> * Use NumericField's for dates and numbers - will remove all string sorting.
> Thanks guys!
>
> --
>
> But - to come back to my original question... is there any way to have a
> "natural order" of documents other that the DocId In Lucene? For example, is
> there any way to have an index automatically sorted on a specific field,
> like :
>
> DocId     Count     Data
> -------------------------------------
>  5         1       First test
>  1         3       Otter
>  8         4       Test
>  2         8       Aloha
>  10        11       Zulu
>  9        17       Bingo
>  3        46       Alpha test
>  6       112       Tango
>  4       120       Charlie test
>  7       200       Kiwi
>
> Notice the DocId and Data random orders, but Count is sorted. That would be
> the 'natural order' in the index, and searching for 'test' would return (in
> that order) :
>
> DocId     Count     Data
> -------------------------------------
>  5         1       First test
>  3        46       Alpha test
>   4       120       Charlie test
>
> Already sorted on the Count.
>
> Thanks!
>
> - Mike
> [email protected]
>
>
> On Tue, Aug 17, 2010 at 4:08 PM, Ian Lea <[email protected]> wrote:
>
>> Using NumericField for dates and other numbers is likely to help a
>> lot, and removes padding problems.  I'd try that first, or just sort
>> the top n hits yourself.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Aug 17, 2010 at 8:46 PM, Michel Nadeau <[email protected]> wrote:
>> > I could at least drop hours/mins/sec, we don't need them, so my timestamp
>> > could become 'YYYYMMDD', that would cut the number of unique terms at
>> least
>> > for dates.
>> >
>> > What about my other question about numbers : *" We do pad our numbers
>> with
>> > zeros though (for example: 10 becomes 00000010, etc.) because we had
>> trouble
>> > with sorting (100 was smaller than 2) ; is that considered as "string
>> > sorting" ? This might explain a part of the problem."* ? Thanks.
>> >
>> > - Mike
>> > [email protected]
>> >
>> >
>> > On Tue, Aug 17, 2010 at 3:40 PM, Erick Erickson <[email protected]
>> >wrote:
>> >
>> >> Hmmm, I glossed over your comment about sorting the top 250. There's
>> >> no reason that wouldn't work.
>> >>
>> >> Well, one way for, say, dates is to store separate fields. YYYY, MM, DD,
>> >> HH, MM, SS, MS. That gives you say, 100 year terms, + 12 month
>> >> +31 days + .... for a very small total. You pay the price though by
>> >> having to change your queries and sorts to respect all 6 fields...
>> >>
>> >> But I'd only really go there after seeing if other options don't work.
>> >>
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Tue, Aug 17, 2010 at 3:35 PM, Michel Nadeau <[email protected]>
>> wrote:
>> >>
>> >> > Would our approach to limit the search top 250 documents (and then
>> sort
>> >> > these 250 documents) work fine ? Or even 250 unique terms with a lot
>> of
>> >> > users is bad on memory when sorting ?
>> >> >
>> >> > We didn't look at trie fields - I will do though, thanks for the tip !
>> >> >
>> >> > We do store the original 'Data' field (only the 'SearchableData' field
>> is
>> >> > analyzed, all other fields are not analyzed), the users mainly sort on
>> >> > numeric values; not a lot on string values (in fact I could compltely
>> >> drop
>> >> > the sort by string feature). We do pad our numbers with zeros though
>> (for
>> >> > example: 10 becomes 00000010, etc.) because we had trouble with
>> sorting
>> >> > (100
>> >> > was smaller than 2) ; is that considered as "string sorting" ? This
>> might
>> >> > explain a part of the problem.
>> >> >
>> >> > Why/how would I reduce the count of unique terms?
>> >> >
>> >> >
>> >> > - Mike
>> >> > [email protected]
>> >> >
>> >> >
>> >> > On Tue, Aug 17, 2010 at 3:28 PM, Erick Erickson <
>> [email protected]
>> >> > >wrote:
>> >> >
>> >> > > If you have tens of millions of documents, almost all with unique
>> >> fields
>> >> > > that you're sorting on, you'll chew through memory like there's no
>> >> > > tomorrow.
>> >> > >
>> >> > > Have you looked at trie fields? See:
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
>> >> > >
>> >> > > I'm a little concerned that the user can sort on Data. Any field
>> used
>> >> for
>> >> > > sorting
>> >> > > should NOT be analyzed, so unless you are indexing "Data"
>> unanalyzed,
>> >> > > that's
>> >> > > a problem. And if you are sorting on strings unique to each
>> document,
>> >> > > that's
>> >> > > also a memory hog. Not to mention whether capitalization counts.
>> >> > >
>> >> > > You might enumerate the terms in your index for each of the sortable
>> >> > fields
>> >> > > to figure out what the total number of unique terms each is and use
>> >> that
>> >> > as
>> >> > > a basis for reducing their count....
>> >> > >
>> >> > > HTH
>> >> > > Erick
>> >> > >
>> >> > > On Tue, Aug 17, 2010 at 3:05 PM, Michel Nadeau <[email protected]>
>> >> wrote:
>> >> > >
>> >> > > > Hi Erick,
>> >> > > >
>> >> > > > Here's some more details about our structure. First here's an
>> example
>> >> > of
>> >> > > > document in our index :
>> >> > > >
>> >> > > >     PrimaryKey        = SJAsfsf353JHGada66GH6 (it's a hash)
>> >> > > >     DocType           = X
>> >> > > >     Data              = This is the data
>> >> > > >     SearchableContent = This is the data
>> >> > > >     DateCreated       = <timestamp>
>> >> > > >     DateModified      = <timestamp>
>> >> > > >     Counter1          = 17
>> >> > > >     Counter2          = 3
>> >> > > >     Average           = 0.17
>> >> > > >     Cost              = 200
>> >> > > >
>> >> > > > The users are able to sort on almost all fields: Data,
>> DateCreated,
>> >> > > > DateModified, Counter1, Counter2, Average, Cost.
>> >> > > >
>> >> > > > When we search, we always search on the 'SearchableContent' field
>> and
>> >> > we
>> >> > > > have at least one filter on the DocType (because we have many
>> >> document
>> >> > > > types
>> >> > > > in the same index). So a common search that would find the
>> document
>> >> > above
>> >> > > > is
>> >> > > > "data *AND DocType:X*" (we automatically add the "*AND DocType:X*"
>> >> part
>> >> > > > using Lucene Filters.
>> >> > > >
>> >> > > > I would say that the number of unique terms in the field being
>> sorted
>> >> > on
>> >> > > is
>> >> > > > very big - for example timestamps, almost all unique, counters,
>> >> > average,
>> >> > > > cost, data... so if a query finds 10M results, it's almost 10M
>> >> > different
>> >> > > > values to sort. About cache and warm-up queries : we don't use
>> >> warm-up
>> >> > > > queries -at all- because we have absolutely no idea of what users
>> are
>> >> > > going
>> >> > > > to search for (they can search for absolutely anything). About
>> >> > "returning
>> >> > > > 10M" documents, right, we don't actually return the 10M documents,
>> we
>> >> > use
>> >> > > > pagination to return documents X to Y of the 10M (and the 10M was
>> >> only
>> >> > an
>> >> > > > example, it can be anywhere between 1K and 100M results). The
>> >> > pagination
>> >> > > > usually works fine and fast, our problem is really sorting.
>> >> > > >
>> >> > > > Our "Lucene Reader" process has 2GB of ram allowed, here's how I
>> >> start
>> >> > it
>> >> > > -
>> >> > > >
>> >> > > >     java -Xmx2048m -jar LuceneReader.jar
>> >> > > >
>> >> > > > The problem really seems to be a ram problem, but I can't be 100%
>> >> sure
>> >> > > (any
>> >> > > > help about how to be sure is welcome).
>> >> > > >
>> >> > > > Our current idea of a solution would be to get maximum 250 results
>> >> (the
>> >> > > > more
>> >> > > > relevant ones; more results than that is totally useless in our
>> >> system)
>> >> > > so
>> >> > > > the sort should work fine on a small data set like that, but we
>> want
>> >> to
>> >> > > > make
>> >> > > > sure we're doing everything right before doing that so we don't
>> run
>> >> in
>> >> > > the
>> >> > > > same problems again.
>> >> > > >
>> >> > > > Thank you very much; let me know if you need any more details!
>> >> > > >
>> >> > > > - Mike
>> >> > > > [email protected]
>> >> > > >
>> >> > > >
>> >> > > > On Mon, Aug 16, 2010 at 4:01 PM, Erick Erickson <
>> >> > [email protected]
>> >> > > > >wrote:
>> >> > > >
>> >> > > > > Let's back up a minute. The number of matched records is not
>> >> > > > > important when sorting, what's important is the number of unique
>> >> > > > > terms in the field being sorted. Do you have any figures on
>> that?
>> >> One
>> >> > > > > very common sorting issue is sorting on very fine date time
>> >> > > resolutions,
>> >> > > > > although your examples don't include that...
>> >> > > > >
>> >> > > > > Now, cache loading is an issue. The very first time you sort on
>> a
>> >> > > field,
>> >> > > > > all the values are read into a cache. Is it possible this is the
>> >> > source
>> >> > > > > of your problems? You can cure this with warmup queries. The
>> >> > take-away
>> >> > > > > is that measuring the response time for the first sorted query
>> is
>> >> > > > > very misleading.
>> >> > > > >
>> >> > > > > Although if you're sorting on many, many, many email addresses,
>> >> > > > > that could be "interesting".
>> >> > > > >
>> >> > > > > The comment "returning 10,000,000 documents" is, I hope, a
>> >> > > > > misstatement. If you're trying to *return* 10M docs sorting
>> >> > > > > is irrelevant compared to assembling that many documents. If
>> >> > > > > you're trying to return the first 100 of 10M documents, it
>> should
>> >> > > > > work.
>> >> > > > >
>> >> > > > > Overall, we need more details on what you're sorting and what
>> >> > > > > you're trying to return as well as how you're measuring before
>> >> > > > > we can say much....
>> >> > > > >
>> >> > > > > Along with how much memory you're giving your JVM to work with,
>> >> > > > > what "exploding" means. Are you CPU bound? IO bound? Swapping?
>> >> > > > > You need to characterize what is going wrong before worrying
>> about
>> >> > > > > solutions......
>> >> > > > >
>> >> > > > > Best
>> >> > > > > Erick
>> >> > > > >
>> >> > > > > On Mon, Aug 16, 2010 at 3:08 PM, Michel Nadeau <
>> [email protected]>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Hi,
>> >> > > > > >
>> >> > > > > > we are building an application using Lucene and we have HUGE
>> data
>> >> > > sets
>> >> > > > > (our
>> >> > > > > > index contains millions and millions and millions of
>> documents),
>> >> > > which
>> >> > > > > > obviously cause us very important problems when sorting. In
>> fact,
>> >> > we
>> >> > > > > > disabled sorting completely because the servers were just
>> >> exploding
>> >> > > > when
>> >> > > > > > trying to sort results in RAM. The users using the system can
>> >> > search
>> >> > > > for
>> >> > > > > > whatever they want, so we never know how many results will be
>> >> > > returned
>> >> > > > -
>> >> > > > > a
>> >> > > > > > search can return 10 documents (no problem with sorting) or
>> >> > > 10,000,000
>> >> > > > > > (huge
>> >> > > > > > sorting problems).
>> >> > > > > >
>> >> > > > > > I woke up this morning and had a flash : is it possible with
>> >> Lucene
>> >> > > to
>> >> > > > > have
>> >> > > > > > a "natural sorting" of documents? For example, let's say I
>> have 3
>> >> > > > columns
>> >> > > > > I
>> >> > > > > > want to be able to sort by : first name, last name, email; I
>> >> would
>> >> > > have
>> >> > > > 3
>> >> > > > > > different indexes with the very same data but with a different
>> >> > > primary
>> >> > > > > key
>> >> > > > > > for sorting. I know it's far fetched, and I have never seen
>> >> > anything
>> >> > > > like
>> >> > > > > > that since I use Lucene, but we're just desperate... how
>> people
>> >> do
>> >> > to
>> >> > > > > have
>> >> > > > > > huge data sets, a lot of users, and sort!?
>> >> > > > > >
>> >> > > > > > Thanks,
>> >> > > > > >
>> >> > > > > > Mike
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: "Natural sorting" of documents in a Lucene index - possible?

Reply via email to