On Thu, Jul 23, 2009 at 05:04:49PM -0500, Steven Pratt wrote:
> Chris Mason wrote:
>> On Thu, Jul 23, 2009 at 01:35:21PM -0500, Steven Pratt wrote:
>>   
>>> I have re-run the raid tests with re-creating the fileset between 
>>> each  of the random write workloads and performance does now match 
>>> the  previous newformat results.  The bad news is that the huge gain 
>>> that I  had attributed to the newformat release, does not really 
>>> exist.  All of  the previous results(except for the newformat run) 
>>> were not re-creating  the fileset, so the gain in performance was due 
>>> only to having a fresh  set of files, not any code changes.
>>>     
>>
>> Thanks for doing all of these runs.  This is still a little different
>> than what I have here, my initial runs are very very fast and after 10
>> or so level out to a relatively low performance on random writes.  With
>> nodatacow, it stays even.
>>
>>   
> Right, I do not see this problem with nodatacow.
>
>>> So, I have done 2 new sets of runs to look into this further. One is 
>>> a 3  hour run of single threaded random write to the RAID system.  I 
>>> have  compared this to ext3.  Performance results are here:    
>>> http://btrfs.boxacle.net/repository/raid/longwrite/longwrite/Longrandomwrite.html
>>>
>>> and graphing of all the iostat data can be found here:
>>>
>>> http://btrfs.boxacle.net/repository/raid/longwrite/summary.html
>>>
>>> The iostat graphs for btrfs are interesting for a number of reasons.  
>>>  First, it takes about 3000 seconds (or 50 minutes) for btrfs to 
>>> reach  steady state.  Second, if you look at write throughput from 
>>> the device  view vs. the btrfs/application view, we see that for a 
>>> application  throughput of 21.5MB/sec it requires 63MB/sec of actual 
>>> disk writes.   That is an overhead of 3 to 1 vs an overhead of ~0 for 
>>> ext3. Also,  looking at the change in iops vs MB/sec, we see that 
>>> while  btrfs starts  out with reasonable size IOs, it quickly 
>>> deteriorate to an average IO  size of only 13kb.  Remember, the 
>>> starting file set is only 100GB on a  2.1TB filesystem, and all data 
>>> is overwrite, and this is single  threaded, so there is no reason 
>>> this should fragment.  It seems like the  allocator is having a 
>>> problem doing sequential allocations.
>>>     
>>
>> There are two things happening.  First the default allocation scheme
>> isn't very well suited to this, mount -o ssd will perform better.  But
>> over the long term, random overwrites to the file cause a lot of writes
>> to the extent allocation tree.  That's really what -o nodatacow is
>> saving us.  There are optimizations we can do, but we're holding off on
>> that in favor of enospc and other pressing things.
>>   
> Well I have -o ssd data that I can upload, but it was worse than  
> without.  I do understand about timing and priorities.
>
>> But, with all of that said, Josef has some really important allocator
>> improvements.  I've put them out along with our pending patches into the
>> experimental branch of the btrfs-unstable tree.  Could you please give
>> this branch a try both with and without the ssd mount option?
>>
>>   
> Sure, will try to get to it tomorrow.

Sorry, I missed a fix in the experimental branch.  I'll push out a
rebased version in a few minutes.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to