hi peter,
   
  two quick questions,,,,
   
  1. could you let me know what kind of response time you were getting with 
solr (as well as the size of data and result sizes)
   
  2. i took a really really quick look at DocSetHitCollector  and saw the 
dreaded 
   
  if (bits==null) bits = new BitSet(maxDoc);
   
  line of code,
   
  since i rewrote some lucene code to support 64-bit search instances i have 
indexes that may reach quite a few GB's , allocating bitset's (arrays of long's 
is quite expensive memory wise and i am still a little skeptical about 
performance with large result sets)
   
  i did some testing of my facet impl and after an overnight webload session 
received about a 500 milli response time average for full faceting (with result 
sets from a few thousand to over 100,000)
   
  would really like to hear your views,,
   
  thanks,
   
  

Peter Keegan <[EMAIL PROTECTED]> wrote:
  I compared Solr's DocSetHitCollector and counting bitset intersections to
get facet counts with a different approach that uses a custom hit collector
that tests each docid hit (bit) with each facets' bitset and increments a
count in a histogram. My assumption was that for queries with few hits, this
would be much faster than always doing bitset intersections/cardinality for
every facet all the time.

However, my throughput testing shows that the Solr method is at least 50%
faster than mine. I'm seeing a big win with the use of the HashDocSet for
lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems
to provide optimal performance. I'm looking forward to trying this with
OpenBitSet.

Peter




On 5/29/06, zzzzz shalev wrote:
>
> i know im a little late replying to this thread, but, in my humble opinion
> the best way to aggregate values (not necessarily terms, but whole values in
> fields) is as follows:
>
> startup stage:
>
> for each field you would like to aggregate create a hashmap
>
> open an index reader and run through all the docs
>
> get the values to be aggregated from the fields of each doc
>
> create a hashcode for each value from each field collected, the hashcode
> should have some sort of prefix indicating which field its from (for exampe:
> 1 = author, 2 = ....) and hence which hash it is stored in (at retrieval
> time, this prefix can be used to easily retrieve the value from the correct
> hash)
>
> place the hashcode/value in the appropriate hash
>
> create an arraylist
>
> at index X in the arraylist place an int array of all the hashcodes
> associated with doc id X
>
> so for example: if i have doc id 0 which contains the values: william
> shakespeare and the value 1797 the array list at index 0 will have an int
> array containing 2 values (the 2 hashcodes of shaklespeare and 1797)
>
> run time:
>
> at run time receive the hits and iterate through the doc ids , aggregate
> the values with direct access into the arraylist (for doc id 10 go to index
> 10 in the arraylist to retrieve the array of hashcodes) and lookups into the
> hashmaps
>
> i tested this today on a small index approx 400,000 docs (1GB of data)
> but i ran queries returning over 100,000 results
>
> my response time was about 550 milliseconds on large (over 100,000)
> result sets
>
> another point, this method should be scalable for much larger indexes as
> well, as it is linear to the result set size and not the index size (which
> is a HUGE bonus)
>
> if anyone wants the code let me know,
>
>
>
>
> Marvin Humphrey wrote:
>
> Thanks, all.
>
> The field cache and the bitsets both seem like good options until the
> collection grows too large, provided that the index does not need to
> be updated very frequently. Then for large collections, there's
> statistical sampling. Any of those options seems preferable to
> retrieving all docs all the time.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> ---------------------------------
> Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get
> Yahoo! Messenger with Voice
>


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to