On 28-Jul-08, at 11:16 PM, Britske wrote:
That sounds interesting. Let me explain my situation, which may be a
variant
of what you are proposing. My documents contain more than 10.000
fields, but
these fields are divided like:
1. about 20 general purpose fields, of which more than 1 can be
selected in
a query.
2. about 10.000 fields of which each query based on some criteria
exactly
selects one field.
Obviously 2. is killing me here, but given the above perhaps it
would be
possible to make 10.000 vertical slices/ indices, and based on the
field to
be selected (from point 2) select the slice/index to search in.
The 10.000 indices would run on the same box, and the 20 general
purpose
fields have have to be copied to all slices (which means some
increase in
overall index size, but managable), but this would give me far more
reasonable sized and compact documents, which would mean (documents
are far
more likely to be in the same cached slot, and be accessed in the
same disk
-seek.
Are all 10k values equally-likely to be retrieved?
Does this make sense?
Well, I would probably split into two indices, one containing the 20
fields and one containing the 10k. However, if the 10k fields are
equally likely to be chosen, this will not help in the long term,
since the working set of disk blocks is still going to be all of them.
Am I correct that this has nothing to do with
Distributed search, since that really is all about horizontal
splitting /
sharding of the index, and what I'm suggesting is splitting
vertically? Is
there some other part of Solr that I can use for this, or would it
be all
home-grown?
There is some stuff that is coming down the pipeline in lucene, but
nothing is currently there. Honestly, it sounds like these extra
fields should just be stored in a separate file/database. I also
wonder if solving the underlying problem really requires storing 10k
values per doc (you haven't given us many clues in this regard)?
-Mike