[
https://issues.apache.org/jira/browse/SOLR-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225481#comment-15225481
]
Joel Bernstein commented on SOLR-8709:
--------------------------------------
I wanted to give an update on this ticket as Solr 6.0 is here and the
TopicStream is part of the release.
I made a pretty serious attempt to devise a stress test that would cause the
TopicStream to miss documents. In the test that I devised the TopicStream never
missed documents.
Here is the outline of the test:
1) Start a multi-threaded client to index documents to Solr. I tested with 5,
8, 12, 16 and 20 indexing threads. Indexing rate was about 22,000 docs per
second with this setup.
2) At the same time start a TopicStream and have it run a *:* query, pulling
all new documents, writing the version numbers to a file.
3) Compare the # of version numbers in the file to number of docs in the index.
First I piped the file to sort | uniq to ensure that no version numbers were
pulled twice.
The outcome of this test was that the number of version numbers in the file
*always* matched the record count in the Solr collection. The TopicStream never
missed documents due to out of order version numbers.
I ran these tests over and over again for several hours. Each time the record
counts matched up.
I'm still confused by this outcome because I expected to be able to cause the
issue. In an offline chat with [[email protected]], he assured me that out of
order version numbers could occur. A review of the code seems to show that it
is possible for out of order version numbers to be added to the index.
But the fact remains that I was not able to break the TopicStream under a
fairly rigorous test scenario.
It is possible that the way that flushes and commits are being processed that
out of order version numbers won't span commit boundaries. In order for the
TopicStream to lose documents the out of order version numbers must span a
commit boundary. But a review of the code did not make this clear.
So until we're able to clear this up I'll consider this an open issue and I'll
mention it in the TopicStream documentation.
If it does turn out that the TopicStream can lose documents due to out-of-order
version numbers the "retentionWindow" described in the comment above will
eliminate the issue.
> Account for out-of-order version numbers in the TopicStream
> -----------------------------------------------------------
>
> Key: SOLR-8709
> URL: https://issues.apache.org/jira/browse/SOLR-8709
> Project: Solr
> Issue Type: Bug
> Reporter: Joel Bernstein
>
> Currently the TopicStream can miss documents if version numbers are received
> out-of-order. The TopicStream sorts on version number so it will only miss
> out-of-order versions that span commit boundaries.
> In order to resolve this issue we can adopt an approach that keeps a set of
> the last N version numbers sent for each Topic. As the documents are scanned
> we can check for documents within this time window that do not appear in the
> sent set. These documents can then be sent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]