nickva commented on PR #4958: URL: https://github.com/apache/couchdb/pull/4958#issuecomment-1919694608
Thanks for running the performance check, @pgj! > I do not think it should produce repeatable results. If a query happened to cause the scanning of 2,000 documents once and then 1,500 for another instance is acceptable — that is how the system works. But returning zero in either case is definitely a miss on the side of accounting. It will depend not just on how many were scanned but also in what order they arrived and how they were merged at the coordinator. The idea is there is some fudge factor there as is. I updated the query script to query the API endpoint multiple times in a row for the same docs: ``` % ./findlimit.py * deleting http://127.0.0.1:15984/db * creating http://127.0.0.1:15984/db {'q': '16'} * creating 10000 docs with val range 100 * created docs in 6.0 seconds * _find in 0.1 seconds 5 docs {'total_keys_examined': 3382, 'total_docs_examined': 3382, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 50.381} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3323, 'total_docs_examined': 3323, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 37.183} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3359, 'total_docs_examined': 3359, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 32.422} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3347, 'total_docs_examined': 3347, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 31.574} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3282, 'total_docs_examined': 3282, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 31.719} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3325, 'total_docs_examined': 3325, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 30.074} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3276, 'total_docs_examined': 3276, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 29.015} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3274, 'total_docs_examined': 3274, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 30.524} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3120, 'total_docs_examined': 3120, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 29.053} * _find in 0.0 seconds 5 docs {'total_keys_examined': 3368, 'total_docs_examined': 3368, 'total_quorum_docs_examined': 0, 'results_returned': 5, 'execution_time_ms': 31.881} * docs: 3382 - 3120 = 262 ``` I like @rnewson's idea to add stats to the view row. The `view_row` is a record, so we'd have to do a bit tuple wrangling across multiple commits (PRs) to avoid breaking online cluster upgrades, but that's the cleanest solution overall. As a bonus, we'd also then avoid odd out-of-order processing of stats like we have here: https://github.com/apache/couchdb/blob/2501fe69b06dd88611b4a3e290f080823476af70/src/fabric/src/fabric_view_all_docs.erl#L260-L274 Where we may accumulate worker stats even though we might not actually have processed the row emitted before. With per-row stats we may miss some stats emitted by workers at the end from workers which sent the rows already, but they just didn't go towards producing the response. So there is some discrepancy between total work induced in the cluster as a result of the API request vs work which took place on the workers before the emitted rows, if those rows were included in the response. > For the chttpd_stats_reporter case, we could stream stats directly to chttpd_stats_reporter instead of going through the same framework that execution stats uses, and would be free to complete sending statistics after the response has been returned to the user. Ah, that's interesting to think about chttpd_stats, @willholley. That's in chttpd and it takes in request/response objects. But if we wanted to emit online stats for each request as it's being processed (a long running request), we could alter the API there such that each worker would get some set of request bits (path, principal, nonce, request id?...) passed to it (in #mrargs extras), and then the worker can report stats independently without having to shuffle them back to the coordinator. Kind of a larger change, but that way it could account for all the work generated as a side-effect of an API call, even if it didn't make as part of the response. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@couchdb.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org