Github user davisp commented on the issue:

    https://github.com/apache/couchdb-couch/pull/185
  
    I like the idea here but I think we need to rethink the implementation a 
bit. First, the timing numbers reported are in microseconds so I'm not super 
convinced on them. What size of file are you testing against and how far back 
is the header? I'd almost wager its only one or two blocks back and the timing 
is fairly random.
    
    However I know you had a big file we found in production where the adaptive 
algorithm probably fared much better. Obviously we can't give that out but we 
can fake the data by creating files with successively more data after the last 
header fairly easily. I'd like to see a test/benchmark of that nature before we 
get too lost in the weeds.
    
    On the implementation side, this is a fine proof of concept but needs a bit 
of cleaning up. First, you've added two exported functions for testing which 
really shouldn't be in the final PR. The way the recursion works having the 
first iteration in find_last_header before moving to find_header_in_chunks 
seems wrong. That recursion should just be a single loop that expands its 
read-ahead starting with the initial values in find_last_header.
    
    Second, the default appears to devolve into nearly exactly the same 
behavior as the current implementation which makes the timing numbers 
surprising. It seems a bit odd to write an improvement that needs to be 
configured to be effective. Given that people are going to get hit by this 
unexpectedly it seems like a fairly obscure expert option to know when and how 
to change the config when a database or view has a timeout when being opened.
    
    Reading this I also realized that it might be beneficial to look at using 
file:pread/2 which does vectored reads. Using that we can read five bytes at 
the last N header offsets and then only read the exact header size when 
checking each spot with the header flag set. Using pread/2 here would allow us 
to scan larger chunks of file at the cost of having a second read for every 
header flag that's set. Assuming that headers are relatively rare events it 
seems like preferring to seek backwards without reading the entire file into 
RAM would be faster than attempting to avoid any follow on reads (given that we 
stop at the first valid header and don't expect many invalid headers). However 
this is obviously something that needs to be tested before we can be sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to