Hi David, I see your point, I am not saying such big low level changes are badly needed today for most of production scenarios; I am just observing that it might become a possibly useful extension, e.g. today word / document embeddings are being used more and more (mostly in research) so that retrieving / scoring docs belonging to same the cluster (or near/similar embeddings wise, regardless of the metric) is a significant part of the query (retrieving/ranking) part.
However I think your suggestion to look in easier solutions first like MultiReader is a good one, e.g. in "my" use case if each doc belongs to a single cluster it might be good to create an index per cluster. Thanks and regards, Tommaso Il giorno lun 16 ott 2017 alle ore 21:28 David Smiley < [email protected]> ha scritto: > Hi Tomaso, > > It's definitely something I've pondered on occasion but I'm left wondering > (a) is it worth it (experimentation will tell), and (b) perhaps Lucene > doesn't need anything new here: see MultiReader. Arguably this can be > handled at the search server layer by constructing multiple IndexWriters > and then a MultiReader over their collective indexes. Perhaps a special > IndexSearcher QueryCache could be developed to partition itself on the > separate underlying readers. Of course it would probably take a lot of > work to retrofit, say Solr, to do this but I'm dubious Lucene should be > saddled with unneeded complexity for this. > > On Thu, Oct 12, 2017 at 9:55 AM Tommaso Teofili <[email protected]> > wrote: > >> Hi all, >> >> having been involved in such kind of challenge and having seen a few more >> similar enquiries on the dev list, I was wondering if it may be time to >> think about making it possible to have an explicit (customizable and >> therefore pluggable) policy which allows people to chime into where >> documents and / or segments get written (on write or on merge). >> Recently there was someone asking about possibly having segments sorted >> by a field using SortingMergePolicy, but as Uwe noted it's currently an >> implementation detail. Personally I have tried (and failed because it was >> too costly) to make sure docs belonging to certain clusters (identified by >> a field) being written within same segments (for data locality / memory >> footprint concerns when "loading" docs from a certain cluster). >> >> As of today that'd be *really* hard, but I just wanted to share my >> feeling that such topic might be something to keep an eye on. >> >> My 2 cents, >> Tommaso >> > -- > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > http://www.solrenterprisesearchserver.com >
