Hi Dave, facets: in you case a solution with one int[IndexReader.maxDoc()] fits. For each document number you can store an integer which represents the facet value. This is what org.apache.solr.request.UnInvertedField will store in your case. (*John* : is there something similar in com.browseengine ? )
But UnInvertedField is designed for fields with more then one value per doc. Possible to implement directly the solution with int[IndexReader.maxDoc()] is more easy. The implementation with int[IndexReader.maxDoc()] should be 150 times faster then your current solution (and use only 16:300 of your main memory). But I still wonder that your solution is slow, did you ever use a profiler? Enough Xmx? Swapping? Possible your implementation of htFacetResults.get is slow? Possible same waiting because of synchronized code? But btw.: Your implementation is not thread save: Think about two htFacetResults.get before one htFacetResults.put INDEXORDER: For INDEXORDER the MultiSearcher and ParallelMultiSearcher use the docNumber for each index as score. So the result is 1. Doc with docNum 1 from first Index 2. Doc with docNum 1 from second Index .. n. Doc with docNum m from first Index n+1. Doc with docNum m from second Index .. I did not know this before. And I was surprised, because the docNum use the "starts"-Array but the score does not. So you can use a BitSet to collect the hits. The bits itself are in correct order. (and you could index and search without frequencies). Best regrads Karsten David Seltzer wrote: > > Karsten, > > You're right, 300 facets would be a lot. Hehe. I have one facet with > about three hundred potential values. What I've done is create an > FacetManager who, in another thread, sets up an map of ~300 OpenBitSets. > One bitset for each possible value of the facet. > > Then, rather than using an iterative cardinality comparison, I use a > HitCollector to create an set of counters. > > public void collect(int doc, float score) { > //we don't care about score, all we care about is docID; > //we need to find out if this document is in any of our facets... if > it is, increment a counter > for(SearchFacet sfTemp : arrayOfSearchFacetsValues) { > if(sfTemp.getBitSet().fastGet(doc)) { > //this is a hit! > long lCount = htFacetResults.get(sfTemp.getTerm().text()); > htFacetResults.put(sfTemp.getTerm().text(), lCount+1); > > //this code is designed for mutually exclusive > //facet values... in that scenario, a hit here means > //that we can't have a hit anywhere else, so we should > //break. > break; > } > } > } > > Here I seem to be running into a performance issue. It seems that when a > resultset is small (~10,000) this method greatly outperforms the > iterating cardinality check. However, when the resultset is large > (300,000) the HitCollector takes twice as long to process the resultset > as the other solution. > > Our total index typically contains about 100M documents. This is broken > up into four monthly indexes each containing 250K documents. And a > typical search returns < 120,000 results. Lousy searches return more > results (IE "obama" returns nearly 800,000 documents). > > At the moment we're using ParalellMultiSearcher. When I do a search, > across four montly indexes, ordered by INDEXORDER what I get is all of > the hits that happened on the first of any month, then all the hits that > happened on the second of any month. Does 'starts' behave the same way > in ParallelMultiSearcher? > > Thanks for all your input! > > -Dave > > -- View this message in context: http://www.nabble.com/Faceting%2C-Sort-and-DocIDSet-tp23099854p23173324.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org