Yep, you are correct, this is a lousy implementation which I knew when I wrote it.
I'm not interested in the entire document just the grouping term and the docId which it is connected to. So how do I get hold of the TermDocs for the grouping field ? I mean I probably first need to perform the query: searcher.search(...) which would give me set of doc ids. Then I need to group them all by for instance: "ip-address", save each ip-address in another set and in the end calculate the size of that set. i.e the equiv of: select count(distinct(ipAddress)) from AccessLog where date='2009-01-25' (optionally group by ipAddress ?) //Marcus On Wed, Jan 28, 2009 at 3:02 PM, Erick Erickson <erickerick...@gmail.com>wrote: > At a quick glance, this line is really suspicious: > > Document document = this.indexReader.document(doc) > > From the Javadoc for HitCollector.collect: > > Note: This is called in an inner search loop. For good search performance, > implementations of this method should not call > > Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or > > IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on > every document number encountered. Doing so can slow searches by an > order > of magnitude or more. > > You're loading the document each time through the loop. I think you'd get > much better > performance by making sure that your groupField is indexed, then use > TermDocs (TermEnum?) > to get the value of the field. > > Best > Erick > > > > On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <marcus.he...@tailsweep.com > >wrote: > > > Hi. > > > > This is way too slow I think since what you are explaining is something I > > already tested. However I might be using the HitCollector badly. > > > > Please prove me wrong. Supplying some code which I tested this with. > > It stores a hash of the value of the term in a TIntHashSet and just > > calculates the size of that set. > > This one takes approx 3 sec on about 0.5M rows = way too slow. > > > > > > main test class: > > public class GroupingTest > > { > > protected static final Log log = > > LogFactory.getLog(GroupingTest.class.getName()); > > static DateFormat df = new SimpleDateFormat("yyyy-MM-dd"); > > public static void main(String[] args) > > { > > Utils.initLogger(); > > String[] fields = > > {"uid","ip","date","siteId","visits","countryCode"}; > > try > > { > > IndexFactory fact = new IndexFactory(); > > String d = "/tmp/csvtest"; > > fact.initDir(d); > > IndexReader reader = fact.getReader(d); > > IndexSearcher searcher = fact.getSearcher(d, reader); > > QueryParser parser = new MultiFieldQueryParser(fields, > > fact.getAnalyzer()); > > Query q = parser.parse("date:20090125"); > > > > > > GroupingHitCollector coll = new GroupingHitCollector(); > > coll.setDistinct(true); > > coll.setGroupField("uid"); > > coll.setIndexReader(reader); > > long start = System.currentTimeMillis(); > > searcher.search(q, coll); > > long stop = System.currentTimeMillis(); > > System.out.println("Time: " + (stop-start) + ", distinct > > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount()); > > } > > catch (Exception e) > > { > > log.error(e.toString(), e); > > } > > } > > } > > > > > > public class GroupingHitCollector extends HitCollector > > { > > protected IndexReader indexReader; > > protected String groupField; > > protected boolean distinct; > > //protected TLongHashSet set; > > protected TIntHashSet set; > > protected int distinctSize; > > > > int count = 0; > > int sum = 0; > > > > public GroupingHitCollector() > > { > > set = new TIntHashSet(); > > } > > > > public String getGroupField() > > { > > return groupField; > > } > > > > public void setGroupField(String groupField) > > { > > this.groupField = groupField; > > } > > > > public IndexReader getIndexReader() > > { > > return indexReader; > > } > > > > public void setIndexReader(IndexReader indexReader) > > { > > this.indexReader = indexReader; > > } > > > > public boolean isDistinct() > > { > > return distinct; > > } > > > > public void setDistinct(boolean distinct) > > { > > this.distinct = distinct; > > } > > > > public void collect(int doc, float score) > > { > > if(distinct) > > { > > try > > { > > Document document = this.indexReader.document(doc); > > if(document != null) > > { > > String s = document.get(groupField); > > if(s != null) > > { > > set.add(s.hashCode()); > > //set.add(Crc64.generate(s)); > > } > > } > > } > > catch (IOException e) > > { > > e.printStackTrace(); > > } > > } > > count++; > > sum += doc; // use it to avoid any possibility of being optimized > > away > > } > > > > public int getCount() { return count; } > > public int getSum() { return sum; } > > > > public int getDistinctCount() > > { > > distinctSize = set.size(); > > return distinctSize; > > } > > } > > > > > > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina...@gmx.de> wrote: > > > > > > > > By the way: if you only need to count documents (count groups) > > HitCollector > > > is a good choice. If you only count you don't need to sort anything. > > > > > > > > > ninaS wrote: > > > > > > > > Hello, > > > > > > > > yes I tried HitCollector but I am not satisfied with it because you > can > > > > not use sorting with HitCollector unless you implement a way to use > > > > TopFieldTocCollector. I did not manage to do that in a performant > way. > > > > > > > > It is easier to first do a normal search und "group by" afterwards: > > > > > > > > Iterate through the result documents and take one of each group. Each > > > > document has a groupingKey. I remember which groupingKey is already > > used > > > > and don't take another document of this group into the result list. > > > > > > > > Regards, > > > > Nina > > > > > > > > > > -- > > > View this message in context: > > > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html > > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.he...@tailsweep.com > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/ > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/