On Tue, Mar 25, 2014 at 12:13:50PM +0000, Martin wrote:
> On 25/03/14 01:49, Marc MERLIN wrote:
> > I had a tree with some amount of thousand files (less than 1 million)
> > on top of md raid5.
> > 
> > It took 18H to rm it in 3 tries:

I ran another test after typing the original Email:
gargamel:/mnt/dshelf2/backup/polgara# time du -sh 20140312-feisty/; time find 
20140 312-feisty/ | wc -l
17G     20140312-feisty/
real    245m19.491s
user    0m2.108s
sys     1m0.508s

728507 <- number of files
real    11m41.853s <- 11mn to restat them when they should all be in cache 
ideally
user    0m1.040s
sys     0m4.360s

4 hours to stat 700K files. That's bad...
Even 11mn to restat them just to count them looks bad too.

> > I checked that btrfs scrub is not running.
> > What else can I check from here?
> 
> "noatime" set?

I have relatime
gargamel:/mnt/dshelf2/backup/polgara# df .
Filesystem           1K-blocks       Used  Available Use% Mounted on
/dev/mapper/dshelf2 7814041600 3026472436 4760588292  39% /mnt/dshelf2/backup

gargamel:/mnt/dshelf2/backup/polgara# grep /mnt/dshelf2/backup /proc/mounts
/dev/mapper/dshelf2 /mnt/dshelf2/backup btrfs 
rw,relatime,compress=lzo,space_cache 0 0
 
> What's your cpu hardware wait time?
 
Sorry, not sure how to get that.
 
> And is not *the 512kByte raid chunk* going to give you horrendous write
> amplification?! For example, rm updates a few bytes in one 4kByte
> metadata block and the system has to then do a read-modify-write on
> 512kBytes...

That's probably not great, but
1) rm -rf should bunch a lot of writes together before they start
hitting the block layer for writes, so I'm not sure that is too much a
problem with the caching layer in between

2) this does not explain 4H to just run du with relatime, which
shouldn't generate any writing, correct?
iostat seems to confirm:

gargamel:~# iostat /dev/md8 1 20
Linux 3.14.0-rc5-amd64-i915-preempt-20140216c (gargamel.svh.merlins.org)        
03/25/2014      _x86_64_        (4 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle  
          75.19    0.00   10.13    8.61    0.00    6.08
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md8              98.00       392.00         0.00        392          0
md8              96.00       384.00         0.00        384          0
md8              83.00       332.00         0.00        332          0
md8             153.00       612.00         0.00        612          0
md8              82.00       328.00         0.00        328          0
md8              55.00       220.00         0.00        220          0
md8              69.00       276.00         0.00        276          0

> Also, the 64MByte chunk bit-intent map will add a lot of head seeks to
> anything you do on that raid. (The map would be better on a separate SSD
> or other separate drive.)

That's true for writing, but not reading, right?
 
> So... That sort of setup is fine for archived data that is effectively
> read-only. You'll see poor performance for small writes/changes.

So I agree with you that the write case can be improved, especially since I 
also have a layer
of dmcrypt in the middle
gargamel:/mnt/dshelf2/backup/polgara# cryptsetup luksDump /dev/md8
LUKS header information for /dev/md8
Cipher name:    aes
Cipher mode:    xts-plain64
Hash spec:      sha1
Payload offset: 8192

(I used cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64)

I'm still not convinced that a lot of file IO don't get all collated in memory 
before hitting disk in bigger blocks, but maybe not.

If I were to recreate this array entirely, what would you use for the raid 
creation
and cryptsetup?

More generally, before I go through all that trouble (it will likely
take 1 week of data copying back and forth), I'd like to debug why my reads are
so slow first.

Thanks,
Marc

On Tue, Mar 25, 2014 at 02:57:57PM +0100, Xavier Nicollet wrote:
> Le 25 mars 2014 à 12:13, Martin a écrit:
> > On 25/03/14 01:49, Marc MERLIN wrote:
> > > It took 18H to rm it in 3 tries:
> 
> > And is not *the 512kByte raid chunk* going to give you horrendous write
> > amplification?! For example, rm updates a few bytes in one 4kByte
> > metadata block and the system has to then do a read-modify-write on
> > 512kBytes...
> 
> My question would be naive, but would it be possible to have a syscall or 
> something to do 
> a fast "rm -rf" or du ?

Well, that wouldn't hurt either, even if it wouldn't address my underlying 
problem.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to