Currently, I can't say what the data actualle represents but the analogy of
t

Mike Klaas wrote:
> 
> On 28-Jul-08, at 11:16 PM, Britske wrote:
> 
>>
>> That sounds interesting. Let me explain my situation, which may be a  
>> variant
>> of what you are proposing. My documents contain more than 10.000  
>> fields, but
>> these fields are divided like:
>>
>> 1. about 20 general purpose fields, of which more than 1 can be  
>> selected in
>> a query.
>> 2. about 10.000 fields of which each query based on some criteria  
>> exactly
>> selects one field.
>>
>> Obviously 2. is killing me here, but given the above perhaps it  
>> would be
>> possible to make 10.000 vertical slices/ indices, and based on the  
>> field to
>> be selected (from point 2) select the slice/index to search in.
>> The 10.000 indices would run on the same box, and the 20 general  
>> purpose
>> fields have have to be copied to all slices (which means some  
>> increase in
>> overall index size, but managable), but this would give me far more
>> reasonable sized and compact documents, which would mean (documents  
>> are far
>> more likely to be in the same cached slot, and be accessed in the  
>> same disk
>> -seek.
> 
> Are all 10k values equally-likely to be retrieved?
> 
> 
Well not exactly, but lets say the probabilities of access between the most
probable and least probable choice is roughly a factor 100. Of  course this
also give me massive room for optimizing (different indices on different
boxes with tuned memory for each)  if I would indeed have 10k seperate
indices. (For simplicity I'm forgetting the remaining 20 fields here). 



>> Does this make sense?
> 
> Well, I would probably split into two indices, one containing the 20  
> fields and one containing the 10k.  However, if the 10k fields are  
> equally likely to be chosen, this will not help in the long term,  
> since the working set of disk blocks is still going to be all of them.
> 

I figured that having 10k seperate indices (if that's at all feasible) would
result in having the values for the same field packed more closely together
on disk, this resulting in less disk seek. Since at any given time query
only 1 out of 10k fields would be chosen, so for a particular query I could
delegate to exactly 1 out of 10k indices to perform the query.  Wouldn't
this limit the disk blocks?



> 
>> Am I correct that this has nothing to do with
>> Distributed search, since that really is all about horizontal  
>> splitting /
>> sharding of the index, and what I'm suggesting is splitting  
>> vertically? Is
>> there some other part of Solr that I can use for this, or would it  
>> be all
>> home-grown?
> 
> There is some stuff that is coming down the pipeline in lucene, but  
> nothing is currently there.  Honestly, it sounds like these extra  
> fields should just be stored in a separate file/database.  I also  
> wonder if solving the underlying problem really requires storing 10k  
> values per doc (you haven't given us many clues in this regard)?
> 
> -Mike
> 
> 

Well, without being able to go in what the data represents (hopefully
later), I found that this analogy works well: 

- Rows in solr represent productcategories. I will have up to 100k of them. 

- Each product category can have 10k products each. These are encoded as the
10k columns / fields (all 10k fields are int values)
 
- At any given at most 1 product per productcategory is returned,
(analoguous to selecting 1 out of 10k columns). (This is the requirements
that makes this scheme possible) 

-products in the same column have certain characteristics in common, which
are encoded in the column name (using dynamic fields). So the combination of
these characteristics uniquely determines 1 out of 10k columns. When the
user hasn't supplied all characteristics good defaults for these
characteristics can be chosen, so a column can always be determined. 

- on top of that each row has 20 productcategory-fields (which all possible
10k products of that category share). 

- the row x column matrix is esssentially sparse (between 10 to 50% is
filled)

The queries performed are a combination of productcategory-filers, and
filters that together uniquely determine 1 out of 10k columns. 

Returned results are those products for which: 
-productcategory filters hold true
- are contained in the selected (1 out of 10k) column

The default way of sorting is by the int-values in the particular selected
column

Since the column-values are used to both filter and sort the results it
should be clear that I can't externalise the 10k fields to a database.
However, I've looked at the possibility to only index (not store) these
columns, but that would mean a post-trip to a database to get the values
(since I need to display them).  However, with the new knowledge of spending
this much time on disk-seeks because of the stored fields, the db-trip might
even be quicker in some circumstances. 

All in all I'm pretty happy with the setup since it effectively enables me
to search in more than 100-500 million products really fast. 

However RAM is a concern, since all 10k fields can be sorted on, so the
potential amount of RAM consumed by lucene fieldcache is pretty big (about
10GB if I remember correctly). 
having the 'need' of a large chunk of the index in OS Cache thus kind of
freaked me out as you can imagine. 

Hope this gives a more clear picture of what I'm trying to achieve. 

Britske


-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18728485.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to