Hey guys,

I asked a question on the forum a couple of weeks ago regarding cursorMark 
duplicates.  I initially thought it may be due to HDFSCaching because I was 
unable replicate the issue on local indexes but unfortunately the dreaded 
duplicates have returned!! For a refresher I was seeing what I thought was 
duplicate documents appearing randomly on the last page of one cursor, and the 
first page of the next.  So if rows=50 the duplicates are document 50 on page 1 
and document 1 on page 2.

After further investigation I don't actually believe these documents are 
duplicates but the same document being returned from a different replica on 
each page.  After running a diff on the two documents the only difference is 
the field "Solr_Update_Date" which I insert on each document as it is inserted 
into the corpus.

This is how the managed-schema mapping for this field looks


<field name="Solr_Update_Date" type="rdate" indexed="true" stored="true" 
default="NOW" />



The only sort parameter is the id field

"sort":"id desc"

rows=50


Here are the results




Document 50 on page 1 is



{

  "responseHeader":{

    "zkConnected":true,

    "status":0,

    "QTime":8,

    "params":{

      "q":"id:\"2019-10-29 15:15:36.748052\"",

      "fl":"id,_version_,[shard],Solr_Update_Date",

      "_":"1574900506126"}},

  "response":{"numFound":1,"start":0,"maxScore":7.312953,"docs":[

      {

        "id":"2019-10-29 15:15:36.748052",

        "Solr_Update_Date":"2019-11-01T00:15:07.811Z",

        "_version_":1648956337338449920,

        
"[shard]":"https://solrHost:9021/solr/my_collection_shard4_replica_n14/|https://solrHost:9022/solr/my_collection_shard4_replica_n12/"}]

  }}



Document 1 on page 2 is


{

  "responseHeader":{

    "zkConnected":true,

    "status":0,

    "QTime":7,

    "params":{

      "q":"id:\"2019-10-29 15:15:36.748052\"",

      "fl":"id,_version_,[shard],Solr_Update_Date",

      "_":"1574900506126"}},

  "response":{"numFound":1,"start":0,"maxScore":7.822712,"docs":[

      {

        "id":"2019-10-29 15:15:36.748052",

        "Solr_Update_Date":"2019-11-01T00:15:07.794Z",

        "_version_":1648956337338449920,

        
"[shard]":"https://solrHost:9022/solr/my_collection_shard4_replica_n12/|https://solrHost:9021/solr/my_collection_shard4_replica_n14/"}]

  }}


As you can see both documents have the same version number but different 
maxScores and Solr_Update_Date's.  My understanding is the cursorMark should 
only be generated off the id field so I can't see why I would get a different 
document from a different shard at the end of one page, and the beginning of 
the next? Would anyone have any insight into this behaviour as this happens 
randomly on page boundaries when paging through results.

Thanks for your time

Dwane


________________________________
From: Dwane Hall <dwaneh...@hotmail.com>
Sent: Monday, 11 November 2019 10:10 PM
To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
Subject: Re: Cursor mark page duplicates

Thanks Erick/Hossman,

I appreciate your input it's always an interesting read seeing Solr legends 
like yourselves work through a problem!  I certainly learn a lot from following 
your responses in this user group.

As you recommended I ran the distrib=false query on each shard and the results 
were the identical in both instances.  Below is a snapshot from the admin ui 
showing the details of each shard which all looks in order to me (other than 
our large number of deletes in the corpus ...we have quite a dynamic 
environment when the index is live)


Last Modified:23 days ago

Num Docs:47247895

Max Doc:68108804

Heap Memory Usage:-1

Deleted Docs:20860909

Version:528038

Segment Count:41



Master (Searching) Version:1571148411550 Gen:25528 Size:42.56 GB

Master (Replicable) Version:1571153302013 Gen:25529



Last Modified:23 days ago

Num Docs:47247895

Max Doc:68223647

Heap Memory Usage:-1

Deleted Docs:20975752

Version:526613

Segment Count:43



Master (Searching) Version:1571148411615 Gen:25527 Size:42.63 GB

Master (Replicable) Version:1571153302076 Gen:25528

I was however able to replicate the issue but under unusual circumstances with 
some crude in browser testing.  If I use a cursorMark other than "*" and 
constantly re-run the query (just resubmitting the url in a browser with the 
same cursor and query) the first result on the page toggles between the 
expected value, and the last item from the previous page.  So if rows=50, page 
2 toggles between result 51 (expected) and result 50 (the last item from the 
previous page).  It doesn't happen all the time but every one in five or so 
refreshes I'm able to replicate it consistently (and on every subsequent 
cursor).

I failed to mention in my original email that we use the HdfsDirectoryFactory 
to store our indexes in HDFS.  This configuration uses an off heap block cache 
to cache HDFS blocks in memory as it is unable to take advantage of the OS disk 
cache.  I mention this as we're currently in the process of switching to local 
disk and I've been unable to replicate the issue when using the local storage 
configuration of the same index.  This maybe completely unrelated, and 
additionally the local storage index is freshly loaded so it has not 
experienced the same number of deletes or updates that our HDFS indexes have.

I think my best bet is to monitor our new index configuration and if I notice 
any similar behaviour I'll make the community aware of my findings.

Once again,

Thanks for your input

Dwane

________________________________
From: Chris Hostetter <hossman_luc...@fucit.org>
Sent: Friday, 8 November 2019 9:58 AM
To: solr-user@lucene.apache.org <solr-user@lucene.apache.org>
Subject: Re: Cursor mark page duplicates


: I'm using Solr's cursor mark feature and noticing duplicates when paging
: through results.  The duplicate records happen intermittently and appear
: at the end of one page, and the beginning of the next (but not on all
: pages through the results). So if rows=20 the duplicate records would be
: document 20 on page1, and document 21 on page 2.  The document's id come

Can you try to reproduce and show us the specifics of this including:

1) The sort param you're using
2) An 'fl' list that includes every field in the sort param
3) The returned values of every 'fl' field for the "duplicate" document
you are seeing as it appears in *BOTH* pages of results -- allong with the
cursorMark value in use on both of those pages.


: (YYYY-MM-DD HH:MM.SSSSSS)), score. In this Solr community post
: 
(https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
: Shawn Heisey suggests:

...that post was *NOT* about using cursorMark -- it was plain old regular
pagination, where even on a single core/replica you can see a document
X get "pushed" from page#1 to page#2 by updates/additions of some other
doxument Z that causes Z to sort "before" X.

With cursors this kind of "pushing other docs back" or "pushing other docs
forward" doesn't exist because of the cursorMark.  The only way a doc
*should* move is if it's OWN sort values are updated, causing it to
reposition itself.

But, if you have a static index, then it's *possible* that the last time
your document X was updated, there was a "glitch" somewhere in the
distributed update process, and the update didn't succeed in osme
replicas -- so the same document may have different sort values
on diff replicas.

: In the Solr query below for one of the example duplicates in question I
: can see a search by the id returns only a single document. The
: replication factor for the collection is 2 so the id will also appear in
: this shards replica.  Taking into consideration Shawn's advice above, my

If you've already identified a particular document where this has
happened, then you can also verify/disprove my hypothosis by hitting each
of the replicas that hosts this document with a request that looks like...

/solr/MyCollection_shard4_replica_n12/select?q=id:FOO&distrib=false
/solr/MyCollection_shard4_replica_n35/select?q=id:FOO&distrib=false

...and compare the results to see if all field values match


-Hoss
http://www.lucidworks.com/

Reply via email to