Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Casper Bang
Abstract
For database testing purposes, a COW filesystem was needed in order to
facilitate snapshotting and rollback, such as to provide mirrors of
our production database at fixed intervals (every night and by
demand).

Platform
An HP Proliant 380P (2x Intel Xeon E5-2620 with 12 cores for a total
of 24 threads) with build-in Smart Array SAS/SATA (Gen8) controllers,
was combined with 10x consumer Samsung 830 512GB SSD (SATAIII, 6Gb/s).
Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
Tue Aug 28 13:03:31 EDT 2012 and Oracle database standard edition
10.2.0.4 64bit.

Setup
OS was installed on fist disk (sda) and the remaining 9 (sdb - sdj)
were pooled into some 4.4TB, for containing Oracle datafiles. An
initial backup of the 1.5TB large prod database would get restored as
a (shut down) sync instance on the test server on the COW filesystem.
A script on the test server, would then apply Oracle archive files
from the production environment to this Oracle sync database, every
10'th minute, effectively making it near up-to-date with production.
The most reliable way to do this was with a simple NFS mount (rather
than rsync or samba). The idea then was, that it would be very fast
and easy to make a new snapshot of the sync database, start it up, and
voila you'd have a new instance ready to play with. A desktop machine
with ext4 partitions proved lower boundary for applying archivelog
data at around 1200 kb/s - we expected an order of magnitude higher
performance on the server.

BTRFS experiences
We used native BTRFS from kernel; with atime off, ssd mode. BTRFS
proved to be very fast at reading for a large TRDBMS (2x speedup
compared to a SAN). However, applying archivelog on a BTRFS filesystem
proved to scale poorly, by starting out with a decent apply rate it
would eventually end down around 400-500 kb/s. BTRFS had to be
abandoned due to this, since the script would never be able to finish
applying archivelog as new ones arrived. The desktop machine with
traditional spinning drives formatted for BTRFS showed a similar
scenario, so hardware (server, controller and disks) was excluded as a
cause.

ZFS experiences
We then tried using ZFS via custom-built SPL/ZFS 0.6.0-rc10 modules
with recordsize equal to that of Oracle database (8K); compression
off, quota off, dedup off, checksum on and atime on.
ZFS proved to be on-pair with a SAN, when it comes to reading for a
large TRDBMS. Thankfully, ZFS did not degrade much in archivelog apply
performance, and proved to have a lower-boundary of 15MB/s.

Conclusion
We had hoped to be able to utilize BTRFS, due to it's license and
inclusion in the Linux mainline kernel. However, for practical
purposes, we're not able to make use of BTRFS due to its performance
when writing -especially considering this is even without mixing in
shapshotting. While ZFS doesn't give us quite the boost in read
performance we had expected from SSD's, it seems more optimized for
writting and will allow us to complete our project of getting clones
of a production database environment up and running in a snap.

Take it for what it's worth, a couple of developers experiences with
BTRFS. We are not likely to go back and change things now it works,
but we are curious as to why we see such big differences between the
two file-systems. Any comments and/or feedback appreciated.

Regards,
Jesper and Casper
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Ralf Hildebrandt
* Casper Bang :

> Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP

And the btrfs was that from vanilla 2.6.39 (i.e. over a year old)?

-- 
Ralf Hildebrandt   Charite Universitätsmedizin Berlin
ralf.hildebra...@charite.deCampus Benjamin Franklin
http://www.charite.de  Hindenburgdamm 30, 12203 Berlin
Geschäftsbereich IT, Abt. Netzwerk fon: +49-30-450.570.155
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Casper Bnag
Ralf Hildebrandt  charite.de> writes:

> 
> * Casper Bang  gmail.com>:
> 
> > Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
> 
> And the btrfs was that from vanilla 2.6.39 (i.e. over a year old)?
> 

We're using the latest available kernel for our Oracle Unbreakable 
Linux 6.3 from Aug 28. We have no other option, since the Oracle database
software needs to run on a certified distro. I have no idea how to check 
the version Oracle actually compiles with, only the tools package has 
easy-to-grasp version info.

In any event, I would think it unlikely that the performance differences 
we see is the result of missing performance tweeks - we're talking an order 
of magnitude here and that smells of a design difference.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Avi Miller
Hi,

On 17/09/2012, at 7:55 PM, Casper Bnag  wrote:

> We're using the latest available kernel for our Oracle Unbreakable 
> Linux 6.3 from Aug 28. We have no other option, since the Oracle database
> software needs to run on a certified distro. 

Oracle Database is not certified to run on either btrfs or ZFS on Linux, so if 
certification is an issue, you can't use either filesystem. Out of interest, 
have you done a performance benchmark with ASM using ASMlib on the same 
platform? 

--
Oracle 
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Casper Bnag
> Oracle Database is not certified to run on either btrfs or ZFS on Linux, so 
> if 
certification is an issue, you can't use either filesystem. 

Right, I had missed that - only ZFS on Solaris is officially supported I 
suppose. We had to draw the line somewhere, and an Oracle OS with an Oracle 
database with an Oracle filesystem seemed like a good platform. If the BTRFS 
pieces are indeed a year old in the latest official binary kernel from last 
month, that just makes me wonder why Oracle didn't use these latest bits. 
Again, I'm inclined to think we're dealing with a design difference between 
ZFS and BTRFS rather than a missing performance optimization. You'd know that 
better than I. :)

> Out of interest, have you done a performance benchmark with ASM using ASMlib
> on the same platform? 

Sorry, no. Our experience with ASM is limited, we came to the conclusion once
that we like being able to handle the files in a plain mountable file-system.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Avi Miller
Hi,

On 17/09/2012, at 8:47 PM, Casper Bnag  wrote:

> month, that just makes me wonder why Oracle didn't use these latest bits. 

We used the most stable release of btrfs that was available when the 
development of the UEK was done. Keep in mind that while it's versioned at 
2.6.39, it's actually 3.0.16 under the hood. It's just that some userspace 
doesn't like having a kernel version that doesn't start with "2.6"

>> Out of interest, have you done a performance benchmark with ASM using ASMlib
>> on the same platform? 
> 
> Sorry, no. Our experience with ASM is limited, we came to the conclusion once
> that we like being able to handle the files in a plain mountable file-system.

Perhaps, but ASM would provide all the functionality you require, including 
snapshots and rollback, at the highest possible performance. Certainly a lot 
higher than both ZFS and btrfs. And it's fully certified and supported by 
Oracle.

As an alternative, why not consider using Oracle VM on the machine and creating 
database VMs instead? You can then use the snapshot capability of Oracle VM 
while still running supported and certified filesystems inside each guest.

(We should also probably take this discussion off-list, as it has drifted away 
from btrfs proper). Feel free to reply to me directly if you want.

--
Oracle 
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-17 Thread Anand Jain




A script on the test server, would then apply Oracle archive files
from the production environment to this Oracle sync database, every
10'th minute, effectively making it near up-to-date with production.



The most reliable way to do this was with a simple NFS mount (rather
than rsync or samba). The idea then was, that it would be very fast
and easy to make a new snapshot of the sync database, start it up, and
voila you'd have a new instance ready to play with. A desktop machine



 archive-log-apply script - if you could, can you share the
 script itself ? or provide more details about the script.
 (It will help to understand the work-load in question).

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-18 Thread Andrew McGlashan
Hi,

On 17/09/2012 8:05 PM, Avi Miller wrote:
> Oracle Database is not certified to run on either btrfs or ZFS on Linux, so 
> if certification is an issue, you can't use either filesystem. Out of 
> interest, have you done a performance benchmark with ASM using ASMlib on the 
> same platform? 

I thought that Oracle considered BTRFS to be production ready.  It
surprises me that running an Oracle database on BTRFS is not a supported
configuration.

Cheers

-- 
Kind Regards
AndrewM

Andrew McGlashan
Broadband Solutions now including VoIP

Current Land Line No: 03 9012 2102
Mobile: 04 2574 1827 Fax: 03 9012 2178

National No: 1300 85 3804

Affinity Vision Australia Pty Ltd
http://affinityvision.com.au
http://securemywireless.com.au
http://adsl2choice.net.au

In Case of Emergency --  http://affinityvision.com.au/ice.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-18 Thread Avi Miller
Hi,

On 19/09/2012, at 2:48 AM, Andrew McGlashan 
 wrote:

> On 17/09/2012 8:05 PM, Avi Miller wrote:
>> Oracle Database is not certified to run on either btrfs or ZFS on Linux, so 
>> if certification is an issue, you can't use either filesystem. Out of 
>> interest, have you done a performance benchmark with ASM using ASMlib on the 
>> same platform? 
> 
> I thought that Oracle considered BTRFS to be production ready.  It
> surprises me that running an Oracle database on BTRFS is not a supported
> configuration.


The Oracle Linux team considers btrfs production-ready and we support it for 
production purposes for customers. However, we have nothing to do with Database 
and their certification process, and the Database (and other) product teams 
have not certified it for use with their products yet. This is also why product 
certification lags: we have nothing to do with individual product certification 
processes on various operating systems/platforms. 

So, while I'm aware that the database team is planning to certify btrfs "at 
some point", I suspect with Oracle OpenWorld coming up in a few weeks time, 
they have other things on their plate right now. :)

--
Oracle 
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-18 Thread Gregory Farnum
On Mon, Sep 17, 2012 at 1:45 AM, Casper Bang  wrote:
> Abstract
> For database testing purposes, a COW filesystem was needed in order to
> facilitate snapshotting and rollback, such as to provide mirrors of
> our production database at fixed intervals (every night and by
> demand).
>
> Platform
> An HP Proliant 380P (2x Intel Xeon E5-2620 with 12 cores for a total
> of 24 threads) with build-in Smart Array SAS/SATA (Gen8) controllers,
> was combined with 10x consumer Samsung 830 512GB SSD (SATAIII, 6Gb/s).
> Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
> Tue Aug 28 13:03:31 EDT 2012 and Oracle database standard edition
> 10.2.0.4 64bit.
>
> Setup
> OS was installed on fist disk (sda) and the remaining 9 (sdb - sdj)
> were pooled into some 4.4TB, for containing Oracle datafiles. An
> initial backup of the 1.5TB large prod database would get restored as
> a (shut down) sync instance on the test server on the COW filesystem.
> A script on the test server, would then apply Oracle archive files
> from the production environment to this Oracle sync database, every
> 10'th minute, effectively making it near up-to-date with production.
> The most reliable way to do this was with a simple NFS mount (rather
> than rsync or samba). The idea then was, that it would be very fast
> and easy to make a new snapshot of the sync database, start it up, and
> voila you'd have a new instance ready to play with. A desktop machine
> with ext4 partitions proved lower boundary for applying archivelog
> data at around 1200 kb/s - we expected an order of magnitude higher
> performance on the server.
>
> BTRFS experiences
> We used native BTRFS from kernel; with atime off, ssd mode. BTRFS
> proved to be very fast at reading for a large TRDBMS (2x speedup
> compared to a SAN). However, applying archivelog on a BTRFS filesystem
> proved to scale poorly, by starting out with a decent apply rate it
> would eventually end down around 400-500 kb/s. BTRFS had to be
> abandoned due to this, since the script would never be able to finish
> applying archivelog as new ones arrived. The desktop machine with
> traditional spinning drives formatted for BTRFS showed a similar
> scenario, so hardware (server, controller and disks) was excluded as a
> cause.

Can you talk more about this decent apply rate ending up down at
400-500kb/s? We've been seeing degrading performance in our workloads
but thought it was due to snapshot abuse. (ie, large writes start out
at say 110MB/s and get slower the longer we run it — though we've
never run it long enough to go slower than about half starting speed.)


>
> ZFS experiences
> We then tried using ZFS via custom-built SPL/ZFS 0.6.0-rc10 modules
> with recordsize equal to that of Oracle database (8K); compression
> off, quota off, dedup off, checksum on and atime on.
> ZFS proved to be on-pair with a SAN, when it comes to reading for a
> large TRDBMS. Thankfully, ZFS did not degrade much in archivelog apply
> performance, and proved to have a lower-boundary of 15MB/s.
>
> Conclusion
> We had hoped to be able to utilize BTRFS, due to it's license and
> inclusion in the Linux mainline kernel. However, for practical
> purposes, we're not able to make use of BTRFS due to its performance
> when writing -especially considering this is even without mixing in
> shapshotting. While ZFS doesn't give us quite the boost in read
> performance we had expected from SSD's, it seems more optimized for
> writting and will allow us to complete our project of getting clones
> of a production database environment up and running in a snap.
>
> Take it for what it's worth, a couple of developers experiences with
> BTRFS. We are not likely to go back and change things now it works,
> but we are curious as to why we see such big differences between the
> two file-systems. Any comments and/or feedback appreciated.
>
> Regards,
> Jesper and Casper
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Casper Bang
> Anand Jain  oracle.com> writes:
>   archive-log-apply script - if you could, can you share the
>   script itself ? or provide more details about the script.
>   (It will help to understand the work-load in question).

Our setup entails a whole bunch of scripts, but the apply script looks like 
this 
(orion is the production environment, pandium is the shadow):
http://pastebin.com/k4T7deap

The script invokes rman passing rman_recover_database.rcs:

connect target /
run {
crosscheck archivelog all;
delete noprompt expired archivelog all;
catalog start with 
'/backup/oracle/flash_recovery_area/FROM_PROD/archivelog' 
noprompt;
recover database;
}

We receive a 1GB archivelog roughly every 20'th minute, depending on the 
workload of the production environment. Apply rate starts out fine with btrfs > 
ext4 > zfs, but ends out with ZFS > ext4 > btrfs. The following numbers are 
from 
our consumer spinning-platter disk test, but they are equally representable to 
the SSD numbers we got.

Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down around 
a 
factor 2.2.

ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around a 
factor 4.4.

Btrfs starts out with a realtime to SCN ratio of about 2.2 and ends down around 
a factor 0.8. This of course means we will never be able to catch up with 
production, as btrfs can't apply these as fast as they're created.

It was even worse with btrfs on our 10xSSD server, where 20 min. of realtime 
work would end up taking some 5h to get applied (factor 0.06), obviously 
useless 
to us.

I should point out, that during this process we also had to move some large 
backup sets around and we saw several times btrfs eating massive IO never to 
finish a simple mv command.

I'm inclined to believe we've found some weak corner, perhaps in combination 
with SSD's - but it led us to compare with ext4 and ZFS, and dismiss btrfs for 
this over ZFS as it solves our problem.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Fajar A. Nugraha
On Wed, Sep 19, 2012 at 2:28 PM, Casper Bang  wrote:
>> Anand Jain  oracle.com> writes:
>>   archive-log-apply script - if you could, can you share the
>>   script itself ? or provide more details about the script.
>>   (It will help to understand the work-load in question).
>
> Our setup entails a whole bunch of scripts, but the apply script looks like 
> this
> (orion is the production environment, pandium is the shadow):
> http://pastebin.com/k4T7deap
>
> The script invokes rman passing rman_recover_database.rcs:

IIRC there were some patches post-3.0 which relates to sync. If oracle
db uses sync writes (or call sync somewhere, which it should), it
might help to re-run the test with more recent kernel. kernel-ml
repository might help.

> Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down 
> around a
> factor 2.2.
>
> ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around 
> a
> factor 4.4.

So zfsonlinux is actually faster than ext4 for that purpuse? coool !

>
> Btrfs starts out with a realtime to SCN ratio of about 2.2 and ends down 
> around
> a factor 0.8. This of course means we will never be able to catch up with
> production, as btrfs can't apply these as fast as they're created.
>
> It was even worse with btrfs on our 10xSSD server, where 20 min. of realtime
> work would end up taking some 5h to get applied (factor 0.06), obviously 
> useless
> to us.

Just wondering, did you use "discard" option by any chance? In my
experience it makes btrfs MUCH slower.

-- 
Fajar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Casper Bang
> IIRC there were some patches post-3.0 which relates to sync. If oracle
> db uses sync writes (or call sync somewhere, which it should), it
> might help to re-run the test with more recent kernel. kernel-ml
> repository might help.

Yeah there doesn't seem to be a shortage of patches coming into btrfs
 (just looking around the mailing-list) so that doesn't surprise me. 
Indeed, reading about race conditions, deadlocks and locks being held too 
long, does not serve to promote btrfs as particular production ready.

> > Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down 
around a
> > factor 2.2.
> >
> > ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down 
around 
a
> > factor 4.4.
> 
> So zfsonlinux is actually faster than ext4 for that purpuse? coool !

Yes, rather amazingly fast - again, seems to us ZFS is optimized for write 
while btrfs is optimized for read.

> Just wondering, did you use "discard" option by any chance? In my
> experience it makes btrfs MUCH slower.

I actually don't remember when we added this (we started out without it), 
but I don't recall seeing a major difference. We should disable it however,
since the stupid fancy HP RAID controller refuses to pass on TRIM and Smart
commands anyway (and the propriatary HP SSD tools refuse to access 
non-enterprise HP SSD's.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Chris Mason
On Mon, Sep 17, 2012 at 02:45:08AM -0600, Casper Bang wrote:
> Abstract
> For database testing purposes, a COW filesystem was needed in order to
> facilitate snapshotting and rollback, such as to provide mirrors of
> our production database at fixed intervals (every night and by
> demand).

Thanks for taking the time to write this up follow through the thread.
It's always interesting to hear situations where btrfs doesn't work
well.

There are three basic problems with the database workloads on btrfs.
First is that we have higher latencies on writes because we are feeding
everything through helper threads for crcs.  Usually the extra latencies
don't show up because we have enough work in the pipeline to keep the
drive busy.

I don't believe the UEK kernels have the recent changes to do some of
the crc work inline (without handing off) for smaller synchronous IOs.

Second, on O_SYNC writes btrfs will write both the file metadata and
data into a special tree so we can be crash safe.  For big files this
tends to spend a lot of time looking for the extents in the file that
have changed.

Josef fixed that up and it is queued for the next merge window.

The third problem is that lots of random writes tend to make lots of
metadata.  If this doesn't fit in ram, we can end up doing many reads
that slow things down.  We're working on this now as well, but recent
kernels change how we cache things and should improve the results.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-09-19 Thread Casper Bang
Chris Mason  fusionio.com> writes:
> There are three basic problems with the database workloads on btrfs.
> First is that we have higher latencies on writes because we are feeding
> everything through helper threads for crcs.  Usually the extra latencies
> don't show up because we have enough work in the pipeline to keep the
> drive busy.
> 
> I don't believe the UEK kernels have the recent changes to do some of
> the crc work inline (without handing off) for smaller synchronous IOs.
> 
> Second, on O_SYNC writes btrfs will write both the file metadata and
> data into a special tree so we can be crash safe.  For big files this
> tends to spend a lot of time looking for the extents in the file that
> have changed.
> 
> Josef fixed that up and it is queued for the next merge window.
> 
> The third problem is that lots of random writes tend to make lots of
> metadata.  If this doesn't fit in ram, we can end up doing many reads
> that slow things down.  We're working on this now as well, but recent
> kernels change how we cache things and should improve the results.

That's good to hear - personally I'd rather use btrfs than ZFS, but it seems we 
were a tad bit early to the party with this kind of workload. Interesting 
nobody 
commented on block-size, I kind of expected that when writing my initial post 
(database using 8KB blocks, tweakable in ZFS but apparently not in btrfs).

/Casper

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-10-08 Thread Casper Bang
> Thanks for taking the time to write this up follow through the thread.
> It's always interesting to hear situations where btrfs doesn't work
> well.
> 
> There are three basic problems with the database workloads on btrfs.
> First is that we have higher latencies on writes because we are feeding
> everything through helper threads for crcs.  Usually the extra latencies
> don't show up because we have enough work in the pipeline to keep the
> drive busy.
> 
> I don't believe the UEK kernels have the recent changes to do some of
> the crc work inline (without handing off) for smaller synchronous IOs.
> 
> Second, on O_SYNC writes btrfs will write both the file metadata and
> data into a special tree so we can be crash safe.  For big files this
> tends to spend a lot of time looking for the extents in the file that
> have changed.
> 
> Josef fixed that up and it is queued for the next merge window.
> 
> The third problem is that lots of random writes tend to make lots of
> metadata.  If this doesn't fit in ram, we can end up doing many reads
> that slow things down.  We're working on this now as well, but recent
> kernels change how we cache things and should improve the results.

I feel I should update my previous thread about performance issues using btrfs 
in light of recent findings. We have discovered that, in all likelihood, what 
we 
experienced and what was described, was not a problem with btrfs per se, but a 
result of a more general issue which btrfs was just really good at exposing 
(using threads more aggressively than zfs?!).

Various benchmarks in Java (thread-pool setup/shutdown) and C (pthreads 
creation 
and joining), has shown that our Xeon/E5-2620 server with the latest Oracle 
Unbreakable Linux has a very slow time serving up new threads (benchmarks 
available upon request).

Java threading benchmark on Xeon/E5-2620 @ 2.0GHz:
Oracle Unbreakable Linux: 1m49s realtime, 3m17s sys-time
Ubuntu:   5s realtime, 3.9s sys-time.

We are not sure how to continue investigating why the Oracle Linux/Kernel 
performs so poorly (scheduler, kernel config etc?), but it seems pretty obvious 
that this issue should be raised with Oracle rather than the btrfs developers - 
though we'll probably look into using another OS entirely. As such, apologies 
for creating the noise, btrfs was not to blame!

If you do have a suspicion or insight on the matter (perhaps work for Oracle, 
or 
know OUK?), of course we'd love a followup offline this list.

Kind regards,
Casper

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences: Why BTRFS had to yield for ZFS

2012-10-08 Thread Avi Miller
Hi,

On 09/10/2012, at 1:38 AM, Casper Bang  wrote:

> If you do have a suspicion or insight on the matter (perhaps work for Oracle, 
> or 
> know OUK?), of course we'd love a followup offline this list.


I've sent an email to Casper to follow this up offline.

Thanks,
Avi

--
Oracle 
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html