The problem description is really long, you know.
I'd attack the statement:

> Since it's not possible to do a RDBMS like search joining the 2
> doc types, I need to run the saved search: find docs where name=Johnson,
> then drop the docs that are not in a doclist.
>

And also, if you remove all markers or starting from empty collection, and
do softCommit after every add, you can use /terms (TermsComponent) as a
"cache of" inserted doclist_ids.

For me it seems more like transient cache for ETL process, this state makes
sense only for single load operation; and not a search engine concern,
really.

Also, you can think from the opposite side:
after you search for the first request: q=name:Johnson and add it result to
markers, the second request might be q=name:Jacobson -name:Johnson etc,
until you exceed maxBooleanClauses limit, that can be leveraged by another
meanings.

and also every request can append list of responded ids into the growing
list of negative terms query:
q=name:Jacobson -{terms f=ids v=$alreadyseen}&alreadyseen=2,4,6,8,...

or they might be joined from markers, if you can afford often softCommit,
etc there are plenty of approaches to keep your hair.


On Tue, May 10, 2016 at 6:44 PM, tedsolr <tsm...@sciquest.com> wrote:

> I'm pulling my hair out on this one - and there's not much of that to begin
> with. The problem I have is that updating 10M denormalized docs in a
> collection takes about 5 hours. Soon there will be collections with 100M
> docs and a 50 hour update cycle will not be acceptable. The process
> involves
> cleaning (deleting) the marker fields, querying the collection with user
> defined saved searches, then updating the marker fields in every matched
> doc. If I can normalize based on the searches the processing time should go
> way down: delete marker docs, query the collection with user defined saved
> searches, then insert marker docs. The time savings comes from 1) deleting
> and inserting docs is faster than updating docs, 2) the number of saved
> searches is at least 1000X less than the number of docs.
>
> A doc may have a couple hundred fields, but looks sorta like this:
> {"id":123_5677899","searchid":"34","name":"Johnson", ...}
>
> To normalize I would remove the searchid into a new doc:
> {"id":"S234","searchid":"34","doclist":["123_5677899","123_5677898",...]}
>
> The "link" is established by the doclist field which is multivalued and
> contains the ids from the real docs. All this is doable, the problem is
> that
> when users create saved searches they must only match docs that have not
> already been matched by another search. That's why there's only one doc
> "type" now - every matched doc has a marker (searchid) which makes the Solr
> search work. Since it's not possible to do a RDBMS like search joining the
> 2
> doc types, I need to run the saved search: find docs where name=Johnson,
> then drop the docs that are not in a doclist.
>
> So, maybe if I manage a custom cache of matched doc ids, I can check each
> returned id against the cache and drop the docs that are not in it. I think
> this could be done in a post filter. There will be a big memory hit to
> maintain this cache, but does this seem like a performant solution to my
> problem?
>
> Thanks!
> v5.2.1
> All collections are one shard with replication factor 2
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Simulate-doc-linking-via-post-filter-cache-check-tp4275842.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mkhlud...@griddynamics.com>

Reply via email to