I'm pulling my hair out on this one - and there's not much of that to begin with. The problem I have is that updating 10M denormalized docs in a collection takes about 5 hours. Soon there will be collections with 100M docs and a 50 hour update cycle will not be acceptable. The process involves cleaning (deleting) the marker fields, querying the collection with user defined saved searches, then updating the marker fields in every matched doc. If I can normalize based on the searches the processing time should go way down: delete marker docs, query the collection with user defined saved searches, then insert marker docs. The time savings comes from 1) deleting and inserting docs is faster than updating docs, 2) the number of saved searches is at least 1000X less than the number of docs.
A doc may have a couple hundred fields, but looks sorta like this: {"id":123_5677899","searchid":"34","name":"Johnson", ...} To normalize I would remove the searchid into a new doc: {"id":"S234","searchid":"34","doclist":["123_5677899","123_5677898",...]} The "link" is established by the doclist field which is multivalued and contains the ids from the real docs. All this is doable, the problem is that when users create saved searches they must only match docs that have not already been matched by another search. That's why there's only one doc "type" now - every matched doc has a marker (searchid) which makes the Solr search work. Since it's not possible to do a RDBMS like search joining the 2 doc types, I need to run the saved search: find docs where name=Johnson, then drop the docs that are not in a doclist. So, maybe if I manage a custom cache of matched doc ids, I can check each returned id against the cache and drop the docs that are not in it. I think this could be done in a post filter. There will be a big memory hit to maintain this cache, but does this seem like a performant solution to my problem? Thanks! v5.2.1 All collections are one shard with replication factor 2 -- View this message in context: http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842.html Sent from the Solr - User mailing list archive at Nabble.com.