nickva commented on PR #4958:
URL: https://github.com/apache/couchdb/pull/4958#issuecomment-1919694608

   Thanks for running the performance check, @pgj!
   
   > I do not think it should produce repeatable results. If a query happened 
to cause the scanning of 2,000 documents once and then 1,500 for another 
instance is acceptable — that is how the system works. But returning zero in 
either case is definitely a miss on the side of accounting.
   
   It will depend not just on how many were scanned but also in what order they 
arrived and how they were merged at the coordinator. The idea is there is some 
fudge factor there as is. I updated the query script to query the API endpoint 
multiple times in a row for the same docs:
   
   ```
   % ./findlimit.py
    * deleting http://127.0.0.1:15984/db
    * creating http://127.0.0.1:15984/db {'q': '16'}
    * creating 10000 docs with val range 100
    * created docs in 6.0 seconds
    * _find in 0.1 seconds 5 docs {'total_keys_examined': 3382, 
'total_docs_examined': 3382, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 50.381}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3323, 
'total_docs_examined': 3323, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 37.183}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3359, 
'total_docs_examined': 3359, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 32.422}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3347, 
'total_docs_examined': 3347, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 31.574}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3282, 
'total_docs_examined': 3282, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 31.719}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3325, 
'total_docs_examined': 3325, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 30.074}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3276, 
'total_docs_examined': 3276, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 29.015}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3274, 
'total_docs_examined': 3274, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 30.524}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3120, 
'total_docs_examined': 3120, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 29.053}
    * _find in 0.0 seconds 5 docs {'total_keys_examined': 3368, 
'total_docs_examined': 3368, 'total_quorum_docs_examined': 0, 
'results_returned': 5, 'execution_time_ms': 31.881}
    * docs: 3382 - 3120 = 262
   ```
   
   I like @rnewson's idea to add stats to  the view row. The `view_row` is a 
record, so we'd have to do a bit tuple wrangling across multiple commits (PRs) 
to avoid breaking online cluster upgrades, but that's the cleanest solution 
overall. 
   
   As a bonus, we'd also then avoid odd out-of-order processing of stats like 
we have here: 
https://github.com/apache/couchdb/blob/2501fe69b06dd88611b4a3e290f080823476af70/src/fabric/src/fabric_view_all_docs.erl#L260-L274
 Where we may accumulate worker stats even though we might not actually have 
processed the row emitted before.
   
   With per-row stats we may miss some stats emitted by workers at the end from 
workers which sent the rows already, but they just didn't go towards producing 
the response. So there is some discrepancy between total work induced in the 
cluster as a result of the API request vs work which took place on the workers 
before the emitted rows, if those rows were included in the response.
   
   > For the chttpd_stats_reporter case, we could stream stats directly to 
chttpd_stats_reporter instead of going through the same framework that 
execution stats uses, and would be free to complete sending statistics after 
the response has been returned to the user.
   
   Ah, that's interesting to think about chttpd_stats, @willholley.  That's in 
chttpd and it takes in request/response objects. But if we wanted to emit 
online stats for each request as it's being processed (a long running request), 
we could alter the API there such that each worker would get some set of 
request bits (path, principal, nonce, request id?...) passed to it (in #mrargs 
extras), and then the worker can report stats independently without having to 
shuffle them back to the coordinator. Kind of a larger change, but that way it 
could account for all the work generated as a side-effect of an API call, even 
if it didn't make as part of the response.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to