[GitHub] couchdb-couch issue #185: 3061 adaptive header search

2016-09-22 Thread theburge
Github user theburge commented on the issue:

https://github.com/apache/couchdb-couch/pull/185
  
Thanks for doing the additional investigation @jaydoane. I'm glad the 
performance remains consistently improved.

From memory, I've observed deeply buried headers frequently in MT, although 
admittedly not lots at once (usually a DB or two). However, they're not only on 
.meta files if memory serves, but also on view files (and I'm fairly sure on 
primary database files, although IIRC usually in correlation with some other 
odd behaviour -- so perhaps it shouldn't be considered _normal_ but it's 
certainly possible).

In any case, it's reassuring that the vectored approach is only tried if 
the basic approach fails, given the striking difference in page-in rate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] couchdb-couch issue #185: 3061 adaptive header search

2016-09-16 Thread theburge
Github user theburge commented on the issue:

https://github.com/apache/couchdb-couch/pull/185
  
@jaydoane Thanks for running and documenting those experiments, I think 
those results look very promising. Reducing time taken and user-space memory 
usage can only be good. :)

FWIW, I'm not convinced major faults is the metric of interest for 
identifying file reads on newly opened file in an empty cache scenario. (My 
thinking is that it shows faults for explicitly mapped ranges that are only 
available on-disk; whereas page ins show pages being pulled in for whatever 
reason via the VM itself. For laughs and demonstration purposes, run `sar -B 1` 
and try some experiments, like dropping cache and loading a `man` page. You'll 
typically see page in rate exceed major fault rate, once you do the 
conversation from pages/s to kB/s. In truth, it's probably most comprehensive 
to look at both fault types and page in/page out rates, but that gets trickier.)

The big surprise here for me is the difference in physical disk operations, 
since my understanding would be that the VM would fill complete OS pages to 
satisfy reads, so I can't see why reading N bytes on page boundaries would be 
much different (from an OS perspective) to reading 4kB. I'd love an explanation 
for that (although the user-space benefits are clear to see and very positive), 
just to be sure it's not some other effect.

My final suggestion would be to evaluate a more serious MT-esque scenario: 
how does the performance improvement of the new algorithm vary with with lots 
(10s to 100s) of files opening concurrently? (And maybe with higher background 
CPU/disk utilisation, if you can set-up the experiment.) Remember that that's 
the scenario where we see this as most relevant, and timeouts there are a big 
issue (file opening times out, compaction keels over, file handle remains, 
etc.), so if there's little or no wall clock time improvement in a highly 
concurrent setting, that'd be a shame and undermine some of the practical 
benefit. Obviously, I'd expect only positive things compared to the status quo 
based on the evaluation above, but performance expectations in varying 
scenarios don't always comply with common sense... :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] couchdb-couch issue #185: 3061 adaptive header search

2016-07-20 Thread theburge
Github user theburge commented on the issue:

https://github.com/apache/couchdb-couch/pull/185
  
@jaydoane: One additional thought I had was that it's taking good care of 
the page cache with `posix_fadvise`, but perhaps not quite such good care of 
the BEAM's memory usage. Have you measured system memory usage when there's a 
lot of trawling for headers?

I think it's worth considering making an explicit garbage collection when 
the code finishes backtracking for headers, since it might have read a number 
of large, off-heap binaries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] couchdb-couch pull request #185: 3061 adaptive header search

2016-07-20 Thread theburge
Github user theburge commented on a diff in the pull request:

https://github.com/apache/couchdb-couch/pull/185#discussion_r71482562
  
--- Diff: src/couch_file.erl ---
@@ -524,6 +525,99 @@ handle_info({'EXIT', _, Reason}, Fd) ->
 {stop, Reason, Fd}.
 
 
+%% first try to read header at end, falling back to chunked reads on 
failure
+find_last_header(Fd, FileSize) ->
+ok = file:advise(Fd, 0, FileSize, dont_need), % minimize disk cache 
pollution
+Pos = (FileSize div ?SIZE_BLOCK) * ?SIZE_BLOCK,
+case file:pread(Fd, Pos, FileSize - Pos) of
+eof ->
+no_valid_header;
+{ok, Data} ->
+try match_header(Fd, Pos, Data, 0) of
+{ok, HeaderBin} ->
+{ok, HeaderBin}
+catch
+error:{badmatch, _} ->
+find_header_in_chunks(Fd, Pos, 1)
+end
+end.
+
+-define(DEFAULT_CHUNK_MAX_SIZE, ?SIZE_BLOCK).
+-define(DEFAULT_CHUNK_EXPONENT_BASE, 1).
+
+chunk_max_size() ->
+config:get_integer("couchdb", "chunk_max_size", 
?DEFAULT_CHUNK_MAX_SIZE).
+
+%% chunk size exponentially increases by iteration, up to some maximum,
+%% and should never exceed the current position in the file
+chunk_size(Pos, Multiplier) ->
+Size = min(Multiplier * ?SIZE_BLOCK, chunk_max_size()),
+min(Pos, Size).
+
+multiplier(Size, Multiplier) ->
+case Size < chunk_max_size() of
+true ->
+Multiplier * config:get_integer(
+"couchdb", "chunk_exponent_base", 
?DEFAULT_CHUNK_EXPONENT_BASE);
+false ->
+Multiplier
+end.
+
+find_header_in_chunks(_Fd, Pos, _Multiplier) when Pos < 0 ->
+no_valid_header;
+find_header_in_chunks(Fd, Pos, Multiplier) ->
+Size = chunk_size(Pos, Multiplier),
+case Size > 0 of
+false ->
+no_valid_header;
+true ->
+{ok, Chunk} = file:pread(Fd, Pos - Size, Size),
+case find_header_in_chunk(Fd, Pos, Chunk, Size) of
+{ok, HeaderBin, _Offset} ->
+%% io:format("found header at ~p multiplier ~p 
chunksize ~p~n",
+%% [Pos - Size + _Offset, Multiplier, Size]),
+{ok, HeaderBin};
+not_found ->
+NewMultiplier = multiplier(Size, Multiplier),
+find_header_in_chunks(Fd, Pos - Size, NewMultiplier)
+end
+end.
+
+find_header_in_chunk(_Fd, _Pos, _Chunk, Offset) when Offset < 0 ->
--- End diff --

In that case, I was going to point out that the [call site 
(L575)](https://github.com/apache/couchdb-couch/pull/185#discussion-diff-71323695R575)
 was a problem, since it identifies `Offset` with `Size`, but [you fixed it 
later](https://github.com/apache/couchdb-couch/pull/185/commits/20a2a1c6e09a70b9be83d960d82d519817f3d052).
 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] couchdb-couch pull request #185: 3061 adaptive header search

2016-07-19 Thread theburge
Github user theburge commented on a diff in the pull request:

https://github.com/apache/couchdb-couch/pull/185#discussion_r71323695
  
--- Diff: src/couch_file.erl ---
@@ -524,6 +525,99 @@ handle_info({'EXIT', _, Reason}, Fd) ->
 {stop, Reason, Fd}.
 
 
+%% first try to read header at end, falling back to chunked reads on 
failure
+find_last_header(Fd, FileSize) ->
+ok = file:advise(Fd, 0, FileSize, dont_need), % minimize disk cache 
pollution
+Pos = (FileSize div ?SIZE_BLOCK) * ?SIZE_BLOCK,
+case file:pread(Fd, Pos, FileSize - Pos) of
+eof ->
+no_valid_header;
+{ok, Data} ->
+try match_header(Fd, Pos, Data, 0) of
+{ok, HeaderBin} ->
+{ok, HeaderBin}
+catch
+error:{badmatch, _} ->
+find_header_in_chunks(Fd, Pos, 1)
+end
+end.
+
+-define(DEFAULT_CHUNK_MAX_SIZE, ?SIZE_BLOCK).
+-define(DEFAULT_CHUNK_EXPONENT_BASE, 1).
+
+chunk_max_size() ->
+config:get_integer("couchdb", "chunk_max_size", 
?DEFAULT_CHUNK_MAX_SIZE).
+
+%% chunk size exponentially increases by iteration, up to some maximum,
+%% and should never exceed the current position in the file
+chunk_size(Pos, Multiplier) ->
+Size = min(Multiplier * ?SIZE_BLOCK, chunk_max_size()),
+min(Pos, Size).
+
+multiplier(Size, Multiplier) ->
+case Size < chunk_max_size() of
+true ->
+Multiplier * config:get_integer(
+"couchdb", "chunk_exponent_base", 
?DEFAULT_CHUNK_EXPONENT_BASE);
+false ->
+Multiplier
+end.
+
+find_header_in_chunks(_Fd, Pos, _Multiplier) when Pos < 0 ->
+no_valid_header;
+find_header_in_chunks(Fd, Pos, Multiplier) ->
+Size = chunk_size(Pos, Multiplier),
+case Size > 0 of
+false ->
+no_valid_header;
+true ->
+{ok, Chunk} = file:pread(Fd, Pos - Size, Size),
+case find_header_in_chunk(Fd, Pos, Chunk, Size) of
+{ok, HeaderBin, _Offset} ->
+%% io:format("found header at ~p multiplier ~p 
chunksize ~p~n",
+%% [Pos - Size + _Offset, Multiplier, Size]),
+{ok, HeaderBin};
+not_found ->
+NewMultiplier = multiplier(Size, Multiplier),
+find_header_in_chunks(Fd, Pos - Size, NewMultiplier)
+end
+end.
+
+find_header_in_chunk(_Fd, _Pos, _Chunk, Offset) when Offset < 0 ->
--- End diff --

Is that guard correct? Or should it be `=<`?

If `Offset = ?SIZE_BLOCK`, it looks to me that it would attempt the second 
clause of`find_header_in_chunk` twice, and presumably lead to a badmatch in 
`match_header`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] couchdb-couch pull request #185: 3061 adaptive header search

2016-07-19 Thread theburge
Github user theburge commented on a diff in the pull request:

https://github.com/apache/couchdb-couch/pull/185#discussion_r71321315
  
--- Diff: src/couch_file.erl ---
@@ -524,6 +525,99 @@ handle_info({'EXIT', _, Reason}, Fd) ->
 {stop, Reason, Fd}.
 
 
+%% first try to read header at end, falling back to chunked reads on 
failure
+find_last_header(Fd, FileSize) ->
+ok = file:advise(Fd, 0, FileSize, dont_need), % minimize disk cache 
pollution
--- End diff --

I agree that calling it after use on a single chunk makes sense. FTR, 
http://man7.org/linux/man-pages/man2/posix_fadvise.2.html adds an additional 
note:

> Requests to discard partial pages are ignored.  It is preferable to
> preserve needed data than discard unneeded data.  If the application
> requires that data be considered for discarding, then offset and len
> must be page-aligned.

I don't the notes on dirty pages are significant to this case, I expect 
it's just pointing out the advisory nature of the call and the fact that 
unbacked storage won't be flushed by this call (vs. the reasonable expectation 
that it _does_ flush and obeys the advice).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---