Re: Large scale sorting

Paul Smith Mon, 09 Apr 2007 15:16:06 -0700


On 10/04/2007, at 4:18 AM, Doug Cutting wrote:

Paul Smith wrote:
Disadvantages to this approach:
* It's a lot more I/O intensive
I think this would be prohibitive. Queries matching more than afew hundred documents will take several seconds to sort, sincerandom disk accesses are required per matching document. Such anapproach is only practical if you can guarantee that queries matchfewer than a hundred documents, which is not generally the case,especially with large collections.

I don't disagree with the premise that it involves substantial I/Oand would increase the time taken to sort, and why this approachshouldn't be the default mechanism, but it's not too difficult tobuild a disk I/O subsystem that can allocate many spindles to servicethis and to allow the underlying OS to use it's buffer cache (yesthis is sounding like a database server now isn't it).

I'm working on the basis that it's a LOT harder/more expensive tosimply allocate more heap size to cover the current sortinginfrastructure. One hits memory limits faster. Not everyone canafford 64-bit hardware with many Gb RAM to allocate to a heap. It_is_ cheaper/easier to build a disk subsystem to tune this I/Oapproach, and one can still use any RAM as buffer cache for thememory-mapped file anyway.
In my experience, raw search time starts to climb towards onesecond per query as collections grow to around 10M documents (inround figures and with lots of assumptions). Thus, searching on asingle CPU is less practical as collections grow substantiallylarger than 10M documents, and distributed solutions are required.So it would be convenient if sorting is also practical for ~10Mdocument collections on standard hardware. If 10M strings with 20characters are required in memory for efficient search, thisrequires 400MB. This is a lot, but not an unusual amount on todaysmachines. However, if you have a large number of fields, then thisapproach may be problematic and force you to consider a distributedsolution earlier than you might otherwise.

400Mb is not a lot in of itself, but when one has many of these typesof indexes, with many sorting fields with many locales on the samehost it becomes problematic. I'm sure there's a point wheredistributing doesn't work over really large collections, because evenif one partitioned an index across many hosts, one still needs tomerge sort the results together.

It would be disappointing if Lucene's innate design limited itself to10M document collections before needing to consider distributedsolutions. 10M is not that many. It would be better if the sortingmechanism in Lucene was a little more decoupled such that morecustomised designs could be utilitised for specific scenarios. Rightnow it's a one-for-all approach without substantial gutting of the code.


cheers,

Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Large scale sorting

Reply via email to