Re: Experiences: Why BTRFS had to yield for ZFS
Thanks for taking the time to write this up follow through the thread. It's always interesting to hear situations where btrfs doesn't work well. There are three basic problems with the database workloads on btrfs. First is that we have higher latencies on writes because we are feeding everything through helper threads for crcs. Usually the extra latencies don't show up because we have enough work in the pipeline to keep the drive busy. I don't believe the UEK kernels have the recent changes to do some of the crc work inline (without handing off) for smaller synchronous IOs. Second, on O_SYNC writes btrfs will write both the file metadata and data into a special tree so we can be crash safe. For big files this tends to spend a lot of time looking for the extents in the file that have changed. Josef fixed that up and it is queued for the next merge window. The third problem is that lots of random writes tend to make lots of metadata. If this doesn't fit in ram, we can end up doing many reads that slow things down. We're working on this now as well, but recent kernels change how we cache things and should improve the results. I feel I should update my previous thread about performance issues using btrfs in light of recent findings. We have discovered that, in all likelihood, what we experienced and what was described, was not a problem with btrfs per se, but a result of a more general issue which btrfs was just really good at exposing (using threads more aggressively than zfs?!). Various benchmarks in Java (thread-pool setup/shutdown) and C (pthreads creation and joining), has shown that our Xeon/E5-2620 server with the latest Oracle Unbreakable Linux has a very slow time serving up new threads (benchmarks available upon request). Java threading benchmark on Xeon/E5-2620 @ 2.0GHz: Oracle Unbreakable Linux: 1m49s realtime, 3m17s sys-time Ubuntu: 5s realtime, 3.9s sys-time. We are not sure how to continue investigating why the Oracle Linux/Kernel performs so poorly (scheduler, kernel config etc?), but it seems pretty obvious that this issue should be raised with Oracle rather than the btrfs developers - though we'll probably look into using another OS entirely. As such, apologies for creating the noise, btrfs was not to blame! If you do have a suspicion or insight on the matter (perhaps work for Oracle, or know OUK?), of course we'd love a followup offline this list. Kind regards, Casper -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Experiences: Why BTRFS had to yield for ZFS
Anand Jain Anand.Jain at oracle.com writes: archive-log-apply script - if you could, can you share the script itself ? or provide more details about the script. (It will help to understand the work-load in question). Our setup entails a whole bunch of scripts, but the apply script looks like this (orion is the production environment, pandium is the shadow): http://pastebin.com/k4T7deap The script invokes rman passing rman_recover_database.rcs: connect target / run { crosscheck archivelog all; delete noprompt expired archivelog all; catalog start with '/backup/oracle/flash_recovery_area/FROM_PROD/archivelog' noprompt; recover database; } We receive a 1GB archivelog roughly every 20'th minute, depending on the workload of the production environment. Apply rate starts out fine with btrfs ext4 zfs, but ends out with ZFS ext4 btrfs. The following numbers are from our consumer spinning-platter disk test, but they are equally representable to the SSD numbers we got. Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down around a factor 2.2. ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around a factor 4.4. Btrfs starts out with a realtime to SCN ratio of about 2.2 and ends down around a factor 0.8. This of course means we will never be able to catch up with production, as btrfs can't apply these as fast as they're created. It was even worse with btrfs on our 10xSSD server, where 20 min. of realtime work would end up taking some 5h to get applied (factor 0.06), obviously useless to us. I should point out, that during this process we also had to move some large backup sets around and we saw several times btrfs eating massive IO never to finish a simple mv command. I'm inclined to believe we've found some weak corner, perhaps in combination with SSD's - but it led us to compare with ext4 and ZFS, and dismiss btrfs for this over ZFS as it solves our problem. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Experiences: Why BTRFS had to yield for ZFS
IIRC there were some patches post-3.0 which relates to sync. If oracle db uses sync writes (or call sync somewhere, which it should), it might help to re-run the test with more recent kernel. kernel-ml repository might help. Yeah there doesn't seem to be a shortage of patches coming into btrfs (just looking around the mailing-list) so that doesn't surprise me. Indeed, reading about race conditions, deadlocks and locks being held too long, does not serve to promote btrfs as particular production ready. Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down around a factor 2.2. ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around a factor 4.4. So zfsonlinux is actually faster than ext4 for that purpuse? coool ! Yes, rather amazingly fast - again, seems to us ZFS is optimized for write while btrfs is optimized for read. Just wondering, did you use discard option by any chance? In my experience it makes btrfs MUCH slower. I actually don't remember when we added this (we started out without it), but I don't recall seeing a major difference. We should disable it however, since the stupid fancy HP RAID controller refuses to pass on TRIM and Smart commands anyway (and the propriatary HP SSD tools refuse to access non-enterprise HP SSD's. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Experiences: Why BTRFS had to yield for ZFS
Chris Mason chris.mason at fusionio.com writes: There are three basic problems with the database workloads on btrfs. First is that we have higher latencies on writes because we are feeding everything through helper threads for crcs. Usually the extra latencies don't show up because we have enough work in the pipeline to keep the drive busy. I don't believe the UEK kernels have the recent changes to do some of the crc work inline (without handing off) for smaller synchronous IOs. Second, on O_SYNC writes btrfs will write both the file metadata and data into a special tree so we can be crash safe. For big files this tends to spend a lot of time looking for the extents in the file that have changed. Josef fixed that up and it is queued for the next merge window. The third problem is that lots of random writes tend to make lots of metadata. If this doesn't fit in ram, we can end up doing many reads that slow things down. We're working on this now as well, but recent kernels change how we cache things and should improve the results. That's good to hear - personally I'd rather use btrfs than ZFS, but it seems we were a tad bit early to the party with this kind of workload. Interesting nobody commented on block-size, I kind of expected that when writing my initial post (database using 8KB blocks, tweakable in ZFS but apparently not in btrfs). /Casper -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Experiences: Why BTRFS had to yield for ZFS
Abstract For database testing purposes, a COW filesystem was needed in order to facilitate snapshotting and rollback, such as to provide mirrors of our production database at fixed intervals (every night and by demand). Platform An HP Proliant 380P (2x Intel Xeon E5-2620 with 12 cores for a total of 24 threads) with build-in Smart Array SAS/SATA (Gen8) controllers, was combined with 10x consumer Samsung 830 512GB SSD (SATAIII, 6Gb/s). Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP Tue Aug 28 13:03:31 EDT 2012 and Oracle database standard edition 10.2.0.4 64bit. Setup OS was installed on fist disk (sda) and the remaining 9 (sdb - sdj) were pooled into some 4.4TB, for containing Oracle datafiles. An initial backup of the 1.5TB large prod database would get restored as a (shut down) sync instance on the test server on the COW filesystem. A script on the test server, would then apply Oracle archive files from the production environment to this Oracle sync database, every 10'th minute, effectively making it near up-to-date with production. The most reliable way to do this was with a simple NFS mount (rather than rsync or samba). The idea then was, that it would be very fast and easy to make a new snapshot of the sync database, start it up, and voila you'd have a new instance ready to play with. A desktop machine with ext4 partitions proved lower boundary for applying archivelog data at around 1200 kb/s - we expected an order of magnitude higher performance on the server. BTRFS experiences We used native BTRFS from kernel; with atime off, ssd mode. BTRFS proved to be very fast at reading for a large TRDBMS (2x speedup compared to a SAN). However, applying archivelog on a BTRFS filesystem proved to scale poorly, by starting out with a decent apply rate it would eventually end down around 400-500 kb/s. BTRFS had to be abandoned due to this, since the script would never be able to finish applying archivelog as new ones arrived. The desktop machine with traditional spinning drives formatted for BTRFS showed a similar scenario, so hardware (server, controller and disks) was excluded as a cause. ZFS experiences We then tried using ZFS via custom-built SPL/ZFS 0.6.0-rc10 modules with recordsize equal to that of Oracle database (8K); compression off, quota off, dedup off, checksum on and atime on. ZFS proved to be on-pair with a SAN, when it comes to reading for a large TRDBMS. Thankfully, ZFS did not degrade much in archivelog apply performance, and proved to have a lower-boundary of 15MB/s. Conclusion We had hoped to be able to utilize BTRFS, due to it's license and inclusion in the Linux mainline kernel. However, for practical purposes, we're not able to make use of BTRFS due to its performance when writing -especially considering this is even without mixing in shapshotting. While ZFS doesn't give us quite the boost in read performance we had expected from SSD's, it seems more optimized for writting and will allow us to complete our project of getting clones of a production database environment up and running in a snap. Take it for what it's worth, a couple of developers experiences with BTRFS. We are not likely to go back and change things now it works, but we are curious as to why we see such big differences between the two file-systems. Any comments and/or feedback appreciated. Regards, Jesper and Casper -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html