Github user davisp commented on the issue: https://github.com/apache/couchdb-couch/pull/185 I like the idea here but I think we need to rethink the implementation a bit. First, the timing numbers reported are in microseconds so I'm not super convinced on them. What size of file are you testing against and how far back is the header? I'd almost wager its only one or two blocks back and the timing is fairly random. However I know you had a big file we found in production where the adaptive algorithm probably fared much better. Obviously we can't give that out but we can fake the data by creating files with successively more data after the last header fairly easily. I'd like to see a test/benchmark of that nature before we get too lost in the weeds. On the implementation side, this is a fine proof of concept but needs a bit of cleaning up. First, you've added two exported functions for testing which really shouldn't be in the final PR. The way the recursion works having the first iteration in find_last_header before moving to find_header_in_chunks seems wrong. That recursion should just be a single loop that expands its read-ahead starting with the initial values in find_last_header. Second, the default appears to devolve into nearly exactly the same behavior as the current implementation which makes the timing numbers surprising. It seems a bit odd to write an improvement that needs to be configured to be effective. Given that people are going to get hit by this unexpectedly it seems like a fairly obscure expert option to know when and how to change the config when a database or view has a timeout when being opened. Reading this I also realized that it might be beneficial to look at using file:pread/2 which does vectored reads. Using that we can read five bytes at the last N header offsets and then only read the exact header size when checking each spot with the header flag set. Using pread/2 here would allow us to scan larger chunks of file at the cost of having a second read for every header flag that's set. Assuming that headers are relatively rare events it seems like preferring to seek backwards without reading the entire file into RAM would be faster than attempting to avoid any follow on reads (given that we stop at the first valid header and don't expect many invalid headers). However this is obviously something that needs to be tested before we can be sure.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---