Re: ways to improve compaction

2010-05-14 Thread Adam Kocoloski
On May 14, 2010, at 11:09 AM, Wout Mertens wrote:

 Old thread I know, but I was wondering about a way to make compaction more 
 fluid:
 
 On Dec 21, 2009, at 23:20 , Damien Katz wrote:
 
 I saw recently some issues people where having with compaction, and I 
 thought I'd get some thoughts down about ways to improve the compaction 
 code/experience.
 
 1. Multi-process pipeline processing. Similar to the enhancements to the 
 view indexing, there is opportunities for pipelining operations instead of 
 the current read/write batch operations it does. This can reduce memory 
 usage and make compaction faster.
 2. Multiple disks/mount points. CouchDB could easily have 2 or more database 
 dirs, and each time it compacts, it copies the new database file to another 
 dir/disk/mountpoint. For servers with multiple disks this will greatly 
 smooth the copying as the disk heads won't need to seek between reads and 
 writes.
 3. Better compaction algorithms. There are all sorts of clever things that 
 could be done to make the compaction faster. Right now it rebuilds the 
 database in a similar manner as if it would if it clients were bulk updating 
 it. This was the simplest way to do it, but certainly not the fastest. There 
 are a lot of ways to make this much more efficient, they just take more work.
 4. Tracking wasted space. This can be used to determine threshold for 
 compaction. We don't  need to track with 100% accuracy how much disk space 
 is being wasted, but it would be a big improvement to at least know how much 
 disk space the raw docs take, and maybe calculate an estimate of the indexes 
 necessary to support them in a freshly compacted database.
 5. Better Low level file driver support. Because we are using the Erlang 
 built-in file system drivers, we don't have access to a lot of flags. If we 
 had our own drivers, one option we'd like to use is to not OS cache the 
 reads and write during the compaction, it's unnecessary for compaction and 
 it could completely consume the cache with rarely accessed data, evicting 
 lots of recently used live data, greatly hurting performance of other 
 databases.
 
 Anyway, just getting these thoughts out. More ideas and especially code 
 welcome.
 
 
 How about
 
 6. Store the databases in multiple files. Instead of one really big file, use 
 several big chunk-files of fixed maximum length. One chunk-file is active 
 and receives writes. Once that chunk-file grows past a certain size, for 
 example 25MB, start a new file. Then, at compaction time, you can do the 
 compaction one chunk-file at a time.
 Possible optimization: If a certain chunk-file has no outdated documents (or 
 only a small %), leave it alone.
 
 I'm armchair-programming here, I have only a vague idea of what the on-disk 
 format looks like, but this could allow continuous compaction, by only 
 compacting (slowly) the completed chunk-files. Furthermore, it would allow 
 spreading the database across multiple disks (since there are now multiple 
 files per db), although one disk would still be receiving all the writes. A 
 smart write scheduler could make sure different databases have different 
 active disks. Possibly, multiple chunk-files could be active at the same 
 time, providing all sorts of interesting failure scenarios ;-)
 
 Thoughts?
 
 Wout.

Hi Wout, Robert Newson suggested the very same in the original thread.  It's a 
solid idea, to be sure.

In related work, there's COUCHDB-738

https://issues.apache.org/jira/browse/COUCHDB-738

I wrote a patch to change the internal database format that allows compaction 
to skip an extra lookup in the by_id tree.  Its a huge win for write-once DBs 
with random docids -- something like a 6x improvement in compaction speed in 
one test.  However, DBs with frequently edited documents become 35-40% larger 
pre- and post-compaction.

Damien has proposed a better alternative in that thread which is a much bigger 
rewrite of the compaction algorithm.  Best,

Adam





[jira] Commented: (COUCHDB-753) Add config option for view compact dir

2010-05-14 Thread Till Klampaeckel (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867547#action_12867547
 ] 

Till Klampaeckel commented on COUCHDB-753:
--

I admit, I haven't really thought this through. My issue is that sometimes 
people run out of disk space with compaction.

You (not necessarily you or CouchDB) could do something like block writes 
etc. when a compact is about to replace the database dir. Expose something from 
the server via JSON?

 Add config option for view compact dir
 --

 Key: COUCHDB-753
 URL: https://issues.apache.org/jira/browse/COUCHDB-753
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Reporter: Till Klampaeckel

 CouchDB creates a foo.view.compact file in the view directory 
 (view_index_dir) when you run compact against a view.
 I'd really like to be able to specify another directory where this .compact 
 file is created and worked on. This is especially helpful when it's difficult 
 to run compaction because you run out of disk space on the same device.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-762) Faster implementation of couch_file:pread_iolist

2010-05-14 Thread Adam Kocoloski (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kocoloski updated COUCHDB-762:
---

Attachment: 762-pread_iolist-v2.patch

An even better patch which does exactly 2 pread() calls in all cases, even for 
MD5-prefixed terms.  Here are updated timings, with this approach termed 
'pread_iolist_3':

4 pread_iolist_bench:go(5000, 1, 1, pread_iolist). 
Median 96
90%103
95%109
99%153
ok

5 pread_iolist_bench:go(5000, 1, 1, pread_iolist2).
Median 82
90%90
95%94
99%107
ok

6 pread_iolist_bench:go(5000, 1, 1, pread_iolist3).
Median 71
90%78
95%81
99%93
ok


 Faster implementation of couch_file:pread_iolist
 

 Key: COUCHDB-762
 URL: https://issues.apache.org/jira/browse/COUCHDB-762
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.11
 Environment: any
Reporter: Adam Kocoloski
Priority: Minor
 Fix For: 1.1

 Attachments: 762-pread_iolist-v2.patch, 762-pread_iolist.patch, 
 patch-to-reproduce-benchmarks.txt, pread_iolist_bench.erl, 
 pread_iolist_results.txt


 couch_file's pread_iolist function is used every time we read anything from 
 disk.  It makes 2-3 gen_server calls to the couch_file process to do its work.
 This patch moves the work done by the read_raw_iolist function into the 
 gen_server itself and adds a pread_iolist handler.  This means that one 
 gen_server call is sufficient in every case.
 Here are some benchmarks comparing the current method with the patch that 
 reduces everything to one call.  I write a number of 10k binaries to a file, 
 then read them back in a random order from 1/5/10/20 concurrent reader 
 processes.  I report the median/90/95/99 percentile response times in 
 microseconds.  In almost every case the patch is an improvement.
 The data was fully cached for these tests; I think that in a real-world 
 concurrent reader scenario the performance improvement may be greater.  The 
 patch ensures that the 2-3 pread calls reading sequential bits of data (term 
 length, MD5, and term) are always submitted without interruption.  
 Previously, two concurrent readers could race to read different terms and 
 cause some extra disk head movement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (COUCHDB-762) Faster implementation of couch_file:pread_iolist

2010-05-14 Thread Adam Kocoloski (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kocoloski closed COUCHDB-762.
--

Resolution: Fixed

applied v2 of the patch

 Faster implementation of couch_file:pread_iolist
 

 Key: COUCHDB-762
 URL: https://issues.apache.org/jira/browse/COUCHDB-762
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.11
 Environment: any
Reporter: Adam Kocoloski
Priority: Minor
 Fix For: 1.1

 Attachments: 762-pread_iolist-v2.patch, 762-pread_iolist.patch, 
 patch-to-reproduce-benchmarks.txt, pread_iolist_bench.erl, 
 pread_iolist_results.txt


 couch_file's pread_iolist function is used every time we read anything from 
 disk.  It makes 2-3 gen_server calls to the couch_file process to do its work.
 This patch moves the work done by the read_raw_iolist function into the 
 gen_server itself and adds a pread_iolist handler.  This means that one 
 gen_server call is sufficient in every case.
 Here are some benchmarks comparing the current method with the patch that 
 reduces everything to one call.  I write a number of 10k binaries to a file, 
 then read them back in a random order from 1/5/10/20 concurrent reader 
 processes.  I report the median/90/95/99 percentile response times in 
 microseconds.  In almost every case the patch is an improvement.
 The data was fully cached for these tests; I think that in a real-world 
 concurrent reader scenario the performance improvement may be greater.  The 
 patch ensures that the 2-3 pread calls reading sequential bits of data (term 
 length, MD5, and term) are always submitted without interruption.  
 Previously, two concurrent readers could race to read different terms and 
 cause some extra disk head movement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-762) Faster implementation of couch_file:pread_iolist

2010-05-14 Thread Adam Kocoloski (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867602#action_12867602
 ] 

Adam Kocoloski commented on COUCHDB-762:


Just a comment about the breakdown of time spent reading a term.  We're looking 
at median response times of 70 µs to read a 10K binary.  I think this is 
roughly distributed as

crypto:md5 - 30 µs
pread()*2 - 20 µs
everything else - µs

 Faster implementation of couch_file:pread_iolist
 

 Key: COUCHDB-762
 URL: https://issues.apache.org/jira/browse/COUCHDB-762
 Project: CouchDB
  Issue Type: Improvement
  Components: Database Core
Affects Versions: 0.11
 Environment: any
Reporter: Adam Kocoloski
Priority: Minor
 Fix For: 1.1

 Attachments: 762-pread_iolist-v2.patch, 762-pread_iolist.patch, 
 patch-to-reproduce-benchmarks.txt, pread_iolist_bench.erl, 
 pread_iolist_results.txt


 couch_file's pread_iolist function is used every time we read anything from 
 disk.  It makes 2-3 gen_server calls to the couch_file process to do its work.
 This patch moves the work done by the read_raw_iolist function into the 
 gen_server itself and adds a pread_iolist handler.  This means that one 
 gen_server call is sufficient in every case.
 Here are some benchmarks comparing the current method with the patch that 
 reduces everything to one call.  I write a number of 10k binaries to a file, 
 then read them back in a random order from 1/5/10/20 concurrent reader 
 processes.  I report the median/90/95/99 percentile response times in 
 microseconds.  In almost every case the patch is an improvement.
 The data was fully cached for these tests; I think that in a real-world 
 concurrent reader scenario the performance improvement may be greater.  The 
 patch ensures that the 2-3 pread calls reading sequential bits of data (term 
 length, MD5, and term) are always submitted without interruption.  
 Previously, two concurrent readers could race to read different terms and 
 cause some extra disk head movement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-704) Replication can lose checkpoints

2010-05-14 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-704:
--

Attachment: (was: rep-history-update-per-checkpoint.patch)

 Replication can lose checkpoints
 

 Key: COUCHDB-704
 URL: https://issues.apache.org/jira/browse/COUCHDB-704
 Project: CouchDB
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.12
Reporter: Randall Leeds
Priority: Minor
 Attachments: save-all-rep-checkpoints.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 When saving replication checkpoints in the _local/repid document the new 
 entry is always pushed onto the _original_ history list property that 
 existed at the start of the replication. When any number of things causes the 
 checkpoint to be written to only one of the databases the head of the history 
 list gets out of sync. Subsequent attempts to start this replication must 
 start from the latest common replication log entry in the _original_ history, 
 as though this replication never occurred.
 A better idea is to push every checkpoint onto the history instead of 
 replacing the head on each save.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (COUCHDB-704) Replication can lose checkpoints

2010-05-14 Thread Randall Leeds (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867724#action_12867724
 ] 

Randall Leeds commented on COUCHDB-704:
---

Looking over this once more.

If we only append to the replication log, would it be more accurate to clear 
the stats after each checkpoint?
The last log entry in the reply to a client request for non-continuous 
replication won't show the total number of documents replicated, but only the 
number since the last checkpoint. I don't know the best way to address this.

 Replication can lose checkpoints
 

 Key: COUCHDB-704
 URL: https://issues.apache.org/jira/browse/COUCHDB-704
 Project: CouchDB
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.12
Reporter: Randall Leeds
Priority: Minor
 Attachments: save-all-rep-checkpoints.patch, whitespace.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 When saving replication checkpoints in the _local/repid document the new 
 entry is always pushed onto the _original_ history list property that 
 existed at the start of the replication. When any number of things causes the 
 checkpoint to be written to only one of the databases the head of the history 
 list gets out of sync. Subsequent attempts to start this replication must 
 start from the latest common replication log entry in the _original_ history, 
 as though this replication never occurred.
 A better idea is to push every checkpoint onto the history instead of 
 replacing the head on each save.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers

2010-05-14 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-761:
--

Attachment: improved-sync-logging.patch

Here's my first run at a patch.

 Timeouts in couch_log are masked, crashes callers
 -

 Key: COUCHDB-761
 URL: https://issues.apache.org/jira/browse/COUCHDB-761
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.10.1, 0.10.2, 0.11
Reporter: Randall Leeds
Priority: Blocker
 Fix For: 0.10.3, 0.11.1, 1.0

 Attachments: improved-sync-logging.patch


 Several users have reported seeing crash reports stemming from a 
 function_clause match on handle_info in various gen_servers. The offending 
 message looks like {#Ref, integer}.
 After months of banter and sleuthing, I determined that the likely cause was 
 a late reply to a gen_server:call that timed out, with the #Ref being the tag 
 on the response. After it came up again today in IRC, kocolosk quickly 
 discovered that the problem appears to be in couch_log.erl.
 The logging macros (?LOG_*)  call couch_log/*_on which calls 
 get_level_integer/0. When this call times out the timeout is eaten and a late 
 reply arrives to the calling process later, triggering the crash.
 Suggestions on how to fix this welcome. Ideas so far are async logging or 
 infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.