Re: Experiences: Why BTRFS had to yield for ZFS

2012-10-08 Thread Casper Bang
 Thanks for taking the time to write this up follow through the thread.
 It's always interesting to hear situations where btrfs doesn't work
 well.
 
 There are three basic problems with the database workloads on btrfs.
 First is that we have higher latencies on writes because we are feeding
 everything through helper threads for crcs.  Usually the extra latencies
 don't show up because we have enough work in the pipeline to keep the
 drive busy.
 
 I don't believe the UEK kernels have the recent changes to do some of
 the crc work inline (without handing off) for smaller synchronous IOs.
 
 Second, on O_SYNC writes btrfs will write both the file metadata and
 data into a special tree so we can be crash safe.  For big files this
 tends to spend a lot of time looking for the extents in the file that
 have changed.
 
 Josef fixed that up and it is queued for the next merge window.
 
 The third problem is that lots of random writes tend to make lots of
 metadata.  If this doesn't fit in ram, we can end up doing many reads
 that slow things down.  We're working on this now as well, but recent
 kernels change how we cache things and should improve the results.

I feel I should update my previous thread about performance issues using btrfs 
in light of recent findings. We have discovered that, in all likelihood, what 
we 
experienced and what was described, was not a problem with btrfs per se, but a 
result of a more general issue which btrfs was just really good at exposing 
(using threads more aggressively than zfs?!).

Various benchmarks in Java (thread-pool setup/shutdown) and C (pthreads 
creation 
and joining), has shown that our Xeon/E5-2620 server with the latest Oracle 
Unbreakable Linux has a very slow time serving up new threads (benchmarks 
available upon request).

Java threading benchmark on Xeon/E5-2620 @ 2.0GHz:
Oracle Unbreakable Linux: 1m49s realtime, 3m17s sys-time
Ubuntu:   5s realtime, 3.9s sys-time.

We are not sure how to continue investigating why the Oracle Linux/Kernel 
performs so poorly (scheduler, kernel config etc?), but it seems pretty obvious 
that this issue should be raised with Oracle rather than the btrfs developers - 
though we'll probably look into using another OS entirely. As such, apologies 
for creating the noise, btrfs was not to blame!

If you do have a suspicion or insight on the matter (perhaps work for Oracle, 
or 
know OUK?), of course we'd love a followup offline this list.

Kind regards,
Casper

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Casper Bang
 Anand Jain Anand.Jain at oracle.com writes:
   archive-log-apply script - if you could, can you share the
   script itself ? or provide more details about the script.
   (It will help to understand the work-load in question).

Our setup entails a whole bunch of scripts, but the apply script looks like 
this 
(orion is the production environment, pandium is the shadow):
http://pastebin.com/k4T7deap

The script invokes rman passing rman_recover_database.rcs:

connect target /
run {
crosscheck archivelog all;
delete noprompt expired archivelog all;
catalog start with 
'/backup/oracle/flash_recovery_area/FROM_PROD/archivelog' 
noprompt;
recover database;
}

We receive a 1GB archivelog roughly every 20'th minute, depending on the 
workload of the production environment. Apply rate starts out fine with btrfs  
ext4  zfs, but ends out with ZFS  ext4  btrfs. The following numbers are 
from 
our consumer spinning-platter disk test, but they are equally representable to 
the SSD numbers we got.

Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down around 
a 
factor 2.2.

ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around a 
factor 4.4.

Btrfs starts out with a realtime to SCN ratio of about 2.2 and ends down around 
a factor 0.8. This of course means we will never be able to catch up with 
production, as btrfs can't apply these as fast as they're created.

It was even worse with btrfs on our 10xSSD server, where 20 min. of realtime 
work would end up taking some 5h to get applied (factor 0.06), obviously 
useless 
to us.

I should point out, that during this process we also had to move some large 
backup sets around and we saw several times btrfs eating massive IO never to 
finish a simple mv command.

I'm inclined to believe we've found some weak corner, perhaps in combination 
with SSD's - but it led us to compare with ext4 and ZFS, and dismiss btrfs for 
this over ZFS as it solves our problem.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Casper Bang
 IIRC there were some patches post-3.0 which relates to sync. If oracle
 db uses sync writes (or call sync somewhere, which it should), it
 might help to re-run the test with more recent kernel. kernel-ml
 repository might help.

Yeah there doesn't seem to be a shortage of patches coming into btrfs
 (just looking around the mailing-list) so that doesn't surprise me. 
Indeed, reading about race conditions, deadlocks and locks being held too 
long, does not serve to promote btrfs as particular production ready.

  Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down 
around a
  factor 2.2.
 
  ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down 
around 
a
  factor 4.4.
 
 So zfsonlinux is actually faster than ext4 for that purpuse? coool !

Yes, rather amazingly fast - again, seems to us ZFS is optimized for write 
while btrfs is optimized for read.

 Just wondering, did you use discard option by any chance? In my
 experience it makes btrfs MUCH slower.

I actually don't remember when we added this (we started out without it), 
but I don't recall seeing a major difference. We should disable it however,
since the stupid fancy HP RAID controller refuses to pass on TRIM and Smart
commands anyway (and the propriatary HP SSD tools refuse to access 
non-enterprise HP SSD's.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Casper Bang
Chris Mason chris.mason at fusionio.com writes:
 There are three basic problems with the database workloads on btrfs.
 First is that we have higher latencies on writes because we are feeding
 everything through helper threads for crcs.  Usually the extra latencies
 don't show up because we have enough work in the pipeline to keep the
 drive busy.
 
 I don't believe the UEK kernels have the recent changes to do some of
 the crc work inline (without handing off) for smaller synchronous IOs.
 
 Second, on O_SYNC writes btrfs will write both the file metadata and
 data into a special tree so we can be crash safe.  For big files this
 tends to spend a lot of time looking for the extents in the file that
 have changed.
 
 Josef fixed that up and it is queued for the next merge window.
 
 The third problem is that lots of random writes tend to make lots of
 metadata.  If this doesn't fit in ram, we can end up doing many reads
 that slow things down.  We're working on this now as well, but recent
 kernels change how we cache things and should improve the results.

That's good to hear - personally I'd rather use btrfs than ZFS, but it seems we 
were a tad bit early to the party with this kind of workload. Interesting 
nobody 
commented on block-size, I kind of expected that when writing my initial post 
(database using 8KB blocks, tweakable in ZFS but apparently not in btrfs).

/Casper

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Casper Bang
Abstract
For database testing purposes, a COW filesystem was needed in order to
facilitate snapshotting and rollback, such as to provide mirrors of
our production database at fixed intervals (every night and by
demand).

Platform
An HP Proliant 380P (2x Intel Xeon E5-2620 with 12 cores for a total
of 24 threads) with build-in Smart Array SAS/SATA (Gen8) controllers,
was combined with 10x consumer Samsung 830 512GB SSD (SATAIII, 6Gb/s).
Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
Tue Aug 28 13:03:31 EDT 2012 and Oracle database standard edition
10.2.0.4 64bit.

Setup
OS was installed on fist disk (sda) and the remaining 9 (sdb - sdj)
were pooled into some 4.4TB, for containing Oracle datafiles. An
initial backup of the 1.5TB large prod database would get restored as
a (shut down) sync instance on the test server on the COW filesystem.
A script on the test server, would then apply Oracle archive files
from the production environment to this Oracle sync database, every
10'th minute, effectively making it near up-to-date with production.
The most reliable way to do this was with a simple NFS mount (rather
than rsync or samba). The idea then was, that it would be very fast
and easy to make a new snapshot of the sync database, start it up, and
voila you'd have a new instance ready to play with. A desktop machine
with ext4 partitions proved lower boundary for applying archivelog
data at around 1200 kb/s - we expected an order of magnitude higher
performance on the server.

BTRFS experiences
We used native BTRFS from kernel; with atime off, ssd mode. BTRFS
proved to be very fast at reading for a large TRDBMS (2x speedup
compared to a SAN). However, applying archivelog on a BTRFS filesystem
proved to scale poorly, by starting out with a decent apply rate it
would eventually end down around 400-500 kb/s. BTRFS had to be
abandoned due to this, since the script would never be able to finish
applying archivelog as new ones arrived. The desktop machine with
traditional spinning drives formatted for BTRFS showed a similar
scenario, so hardware (server, controller and disks) was excluded as a
cause.

ZFS experiences
We then tried using ZFS via custom-built SPL/ZFS 0.6.0-rc10 modules
with recordsize equal to that of Oracle database (8K); compression
off, quota off, dedup off, checksum on and atime on.
ZFS proved to be on-pair with a SAN, when it comes to reading for a
large TRDBMS. Thankfully, ZFS did not degrade much in archivelog apply
performance, and proved to have a lower-boundary of 15MB/s.

Conclusion
We had hoped to be able to utilize BTRFS, due to it's license and
inclusion in the Linux mainline kernel. However, for practical
purposes, we're not able to make use of BTRFS due to its performance
when writing -especially considering this is even without mixing in
shapshotting. While ZFS doesn't give us quite the boost in read
performance we had expected from SSD's, it seems more optimized for
writting and will allow us to complete our project of getting clones
of a production database environment up and running in a snap.

Take it for what it's worth, a couple of developers experiences with
BTRFS. We are not likely to go back and change things now it works,
but we are curious as to why we see such big differences between the
two file-systems. Any comments and/or feedback appreciated.

Regards,
Jesper and Casper
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html