Yep. Probably an external sort should be used when flushing to disk. I have written such code so that is probably a no brainer, the problem is to get it speedy :) <http://dev.tailsweep.com/projects/utils/apidocs/org/tailsweep/utils/sort/TupleSorter.html> http://dev.tailsweep.com/projects/utils/apidocs/com/tailsweep/utils/sort/TupleSorter.html
Another way could be to use HDFS and MapFiles/SequenceFiles Not speedy at all but scalable. Thinking of writing my own Inverted Index, specialized for these kind of operations. Any pointers in where to start look for material for that ? /Marcus On Wed, Jan 28, 2009 at 5:02 PM, Mark Miller <markrmil...@gmail.com> wrote: > Group-by in Lucene/Solr has not been solved in a great general way yet to > my knowledge. > > Ideally, we would want a solution that does not need to fit into memory. > However, you need the value of the field for each document. to do the > grouping As you are finding, this is not cheap to get. Currently, the > efficient way to get it is to use a FieldCache. This, however, requires that > every distinct value can fit into memory. > > Once you have efficient access to the values, you need to be able to > efficiently group the results, again not bounded by memory (which we already > are with the FieldCache). > > There are quite a few ways to do this. The simplest is to group until you > have used all the memory you want, then for everything left, anything that > doesnt match a group, write it to a file, if it does, increment the group > count. Use the overflow file as the input in the next run, repeat until > there is no overflow. You can improve on that by partitioning the overflow > file. > > And then there are a dozen other methods. > > Solr has a patch in JIRA that uses a sorting method. First the results are > sorted on the group-by field, then scanned through for grouping - all field > values that are the same will be next to each other. Finally, if you really > wanted to sort on a different field, another sort is applied. Thats not > ideal IMO, but its a start. > > - Mark > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/