On 28-Jul-08, at 11:16 PM, Britske wrote:


That sounds interesting. Let me explain my situation, which may be a variant of what you are proposing. My documents contain more than 10.000 fields, but
these fields are divided like:

1. about 20 general purpose fields, of which more than 1 can be selected in
a query.
2. about 10.000 fields of which each query based on some criteria exactly
selects one field.

Obviously 2. is killing me here, but given the above perhaps it would be possible to make 10.000 vertical slices/ indices, and based on the field to
be selected (from point 2) select the slice/index to search in.
The 10.000 indices would run on the same box, and the 20 general purpose fields have have to be copied to all slices (which means some increase in
overall index size, but managable), but this would give me far more
reasonable sized and compact documents, which would mean (documents are far more likely to be in the same cached slot, and be accessed in the same disk
-seek.

Are all 10k values equally-likely to be retrieved?

Does this make sense?

Well, I would probably split into two indices, one containing the 20 fields and one containing the 10k. However, if the 10k fields are equally likely to be chosen, this will not help in the long term, since the working set of disk blocks is still going to be all of them.

Am I correct that this has nothing to do with
Distributed search, since that really is all about horizontal splitting / sharding of the index, and what I'm suggesting is splitting vertically? Is there some other part of Solr that I can use for this, or would it be all
home-grown?

There is some stuff that is coming down the pipeline in lucene, but nothing is currently there. Honestly, it sounds like these extra fields should just be stored in a separate file/database. I also wonder if solving the underlying problem really requires storing 10k values per doc (you haven't given us many clues in this regard)?

-Mike

Reply via email to