I'm pulling my hair out on this one - and there's not much of that to begin
with. The problem I have is that updating 10M denormalized docs in a
collection takes about 5 hours. Soon there will be collections with 100M
docs and a 50 hour update cycle will not be acceptable. The process involves
cleaning (deleting) the marker fields, querying the collection with user
defined saved searches, then updating the marker fields in every matched
doc. If I can normalize based on the searches the processing time should go
way down: delete marker docs, query the collection with user defined saved
searches, then insert marker docs. The time savings comes from 1) deleting
and inserting docs is faster than updating docs, 2) the number of saved
searches is at least 1000X less than the number of docs.

A doc may have a couple hundred fields, but looks sorta like this:
{"id":123_5677899","searchid":"34","name":"Johnson", ...}

To normalize I would remove the searchid into a new doc:
{"id":"S234","searchid":"34","doclist":["123_5677899","123_5677898",...]}

The "link" is established by the doclist field which is multivalued and
contains the ids from the real docs. All this is doable, the problem is that
when users create saved searches they must only match docs that have not
already been matched by another search. That's why there's only one doc
"type" now - every matched doc has a marker (searchid) which makes the Solr
search work. Since it's not possible to do a RDBMS like search joining the 2
doc types, I need to run the saved search: find docs where name=Johnson,
then drop the docs that are not in a doclist.

So, maybe if I manage a custom cache of matched doc ids, I can check each
returned id against the cache and drop the docs that are not in it. I think
this could be done in a post filter. There will be a big memory hit to
maintain this cache, but does this seem like a performant solution to my
problem?

Thanks!
v5.2.1
All collections are one shard with replication factor 2



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to