Re: [Openstack-operators] [swift] data distribution imbalance - what are the .data.XXXXX files?

2016-07-20 Thread Blair Bethwaite
Romain,

Thanks a lot for that pointer, it was surprisingly difficult to find
anything about this via usual web search. After a little scripting and
a bit of coordinated effort we're looking much healthier.

We noticed the worst offender directories appeared to have multiple
"nested" rsync temp files left behind, i.e., the replicator on a peer
host was trying to copy existing rsync partial files (e.g.
./objects/27913/91c/1b4263d0e93452b461572bbf57f9591c/.1467136978.92866.data.1RdQku.bCyhfi.6JlGol)
and thus quickly compounding the problem!

Cheers,

On 20 July 2016 at 20:25, Romain LE DISEZ  wrote:
> Hi Blair,
>
> they are temporary files from rsync, when the replicator tried to replicate a 
> partition and failed for some reason.
>
> You can safely delete them as long as there mtime is a bit old (do not delete 
> a file currently being replicated). Since 2.7, Swift take care of that:
> https://github.com/openstack/swift/blob/master/CHANGELOG#L226
>
>
>
> Le Mercredi 20 Juillet 2016 10:17 CEST, Blair Bethwaite 
>  a écrit:
>
>> Hi all,
>>
>> As per the subject, wondering where these files come from, e.g.,:
>>
>> root@stor010:/srv/node/sdc1/objects# ls -la
>> ./109794/359/6b389b24749b7046344ffd2a42aab359
>> total 1195784
>> drwxr-xr-x 2 swift swift  4096 Jun  8 04:11 .
>> drwxr-xr-x 3 swift swift53 May 22 05:05 ..
>> -rw--- 1 swift swift 20480 Jun  8 04:11 1463857426.65100.data
>> -rw--- 1 swift swift   6225920 Jun  3 00:42 .1463857426.65100.data.aCtGLk
>> -rw--- 1 swift swift 197754880 Jun  2 11:49 .1463857426.65100.data.AMQhPo
>> -rw--- 1 swift swift  33980416 Jun  3 00:41 .1463857426.65100.data.CkpDSv
>> -rw--- 1 swift swift   7634944 Jun  3 04:02
>> .1463857426.65100.data.CkpDSv.CtrQws
>> -rw--- 1 swift swift 189399040 Jun  1 18:42 .1463857426.65100.data.CRFb2k
>> -rw--- 1 swift swift  47644672 Jun  2 11:51 .1463857426.65100.data.dKsUZI
>> -rw--- 1 swift swift 157122560 Jun  3 13:57 .1463857426.65100.data.GpmbOK
>> -rw--- 1 swift swift 174489600 Jun  2 11:50 .1463857426.65100.data.MAoI3y
>> -rw--- 1 swift swift 174358528 Jun  3 00:42 .1463857426.65100.data.Pbsk7S
>> -rw--- 1 swift swift  31064064 Jun  1 18:42 .1463857426.65100.data.xlmmie
>>
>> We have a geo-replicated cluster that is currently suffering from
>> major outliers in disk usage, i.e.:
>>
>> [2016-07-20 18:08:33] Checking disk usage now
>> Distribution Graph:
>>   0%2 *
>>  32%1
>>  34%   19 **
>>  35%   49 **
>>  36%  127 
>> *
>>  37%  111 
>>  38%   40 *
>>  39%   12 **
>>  40%6 ***
>>  41%6 ***
>>  42%2 *
>>  43%4 **
>>  44%3 *
>>  45%1
>>  46%3 *
>>  47%4 **
>>  48%3 *
>>  50%1
>>  51%2 *
>>  52%1
>>  53%1
>>  54%1
>>  56%3 *
>>  58%1
>>  62%2 *
>>  63%1
>>  71%1
>>  73%1
>>  75%1
>>  76%1
>>  78%2 *
>>  88%1
>>  92%2 *
>>  95%1
>>  96%1
>>  100%3 *
>> Disk usage: space used: 395001580875776 of 995614295568384
>> Disk usage: space free: 600612714692608 of 995614295568384
>> Disk usage: lowest: 0.0%, highest: 100.0%, avg: 39.6741572147%
>>
>> It looks like this is attributable to a handful of object directories
>> with lots of data.XX files in them, whereas >99% of object dirs
>> just have a single .data file, e.g. this from one of the disks at
>> ~60%:
>>
>> root@stor010:/srv/node/sdc1/objects# find . -mindepth 4 -type f
>> -printf "%h\n" | sort | uniq -c | sort -rnk 1 | head -20
>>
>> 733 ./151107/3b5/9390f9c2ceee07f059a0d1f651e423b5
>>  11 ./109794/359/6b389b24749b7046344ffd2a42aab359
>>   9 ./248385/60c/f2907cb0b290def6f614bf46a715a60c
>>   5 ./222791/888/d991c8db1e2f1e724c1a4f52914f7888
>>   4 ./257772/140/fbbb1c017a841e6e821ed707025fe140
>>   4 ./231068/ca6/e1a706f50dd99f97fafeba6bd1f47ca6
>>   4 ./215734/80c/d2adbf087b09ca24cc546497d265180c
>>   3 ./248166/8b9/f259b3b0113b522ec5ef5753588438b9
>>   3 ./221060/383/d7e101fd86eeef65f8c89f3f99ce4383
>>   2 ./38609/203/25b46d47af8ee700e87ceab33748b203
>>   2 ./27961/a78/1b4e5665c16e1da8a70cc093a2bbba78
>>   2 ./158466/d43/9ac0b8bd4e592fe6f3360a731bac7d43
>>   2 ./141275/588/89f6efb554e964aeebffa9dad9e17588
>>   1 ./99980/fad/61a3214454426f7b30fc62773eb3bfad
>>   1 ./99980/fa4/61a311d12248a2e4ca8d2f61a3adafa4
>>   1 ./99980/ed9/61a32b429174f6099b0b35b0103e4ed9
>>   1 ./99980/e76/61a3129de66d7541ce127f07a7737e76
>>   1 ./99980/e3d/61a329332713fdacedb10834901f6e3d
>>   1 ./99980/e1f/61a306b87307636cc569bd0bc047ee1f
>>   1 ./99980/db8/61a30a00a9b7edd3318be25bbed47db8
>>
>> root@stor010:/srv/node/sdc1/objects# du -sh
>> ./151107/3b5/9390f9c2ceee07f059a0d1f651e423b5
>> 957G./151107/3b5/9390f9c2ceee07f059a0d1f651e423b5
>>
>> We're on version 2.5.0.7.

[Openstack-operators] [swift] data distribution imbalance - what are the .data.XXXXX files?

2016-07-20 Thread Blair Bethwaite
Hi all,

As per the subject, wondering where these files come from, e.g.,:

root@stor010:/srv/node/sdc1/objects# ls -la
./109794/359/6b389b24749b7046344ffd2a42aab359
total 1195784
drwxr-xr-x 2 swift swift  4096 Jun  8 04:11 .
drwxr-xr-x 3 swift swift53 May 22 05:05 ..
-rw--- 1 swift swift 20480 Jun  8 04:11 1463857426.65100.data
-rw--- 1 swift swift   6225920 Jun  3 00:42 .1463857426.65100.data.aCtGLk
-rw--- 1 swift swift 197754880 Jun  2 11:49 .1463857426.65100.data.AMQhPo
-rw--- 1 swift swift  33980416 Jun  3 00:41 .1463857426.65100.data.CkpDSv
-rw--- 1 swift swift   7634944 Jun  3 04:02
.1463857426.65100.data.CkpDSv.CtrQws
-rw--- 1 swift swift 189399040 Jun  1 18:42 .1463857426.65100.data.CRFb2k
-rw--- 1 swift swift  47644672 Jun  2 11:51 .1463857426.65100.data.dKsUZI
-rw--- 1 swift swift 157122560 Jun  3 13:57 .1463857426.65100.data.GpmbOK
-rw--- 1 swift swift 174489600 Jun  2 11:50 .1463857426.65100.data.MAoI3y
-rw--- 1 swift swift 174358528 Jun  3 00:42 .1463857426.65100.data.Pbsk7S
-rw--- 1 swift swift  31064064 Jun  1 18:42 .1463857426.65100.data.xlmmie

We have a geo-replicated cluster that is currently suffering from
major outliers in disk usage, i.e.:

[2016-07-20 18:08:33] Checking disk usage now
Distribution Graph:
  0%2 *
 32%1
 34%   19 **
 35%   49 **
 36%  127 *
 37%  111 
 38%   40 *
 39%   12 **
 40%6 ***
 41%6 ***
 42%2 *
 43%4 **
 44%3 *
 45%1
 46%3 *
 47%4 **
 48%3 *
 50%1
 51%2 *
 52%1
 53%1
 54%1
 56%3 *
 58%1
 62%2 *
 63%1
 71%1
 73%1
 75%1
 76%1
 78%2 *
 88%1
 92%2 *
 95%1
 96%1
 100%3 *
Disk usage: space used: 395001580875776 of 995614295568384
Disk usage: space free: 600612714692608 of 995614295568384
Disk usage: lowest: 0.0%, highest: 100.0%, avg: 39.6741572147%

It looks like this is attributable to a handful of object directories
with lots of data.XX files in them, whereas >99% of object dirs
just have a single .data file, e.g. this from one of the disks at
~60%:

root@stor010:/srv/node/sdc1/objects# find . -mindepth 4 -type f
-printf "%h\n" | sort | uniq -c | sort -rnk 1 | head -20

733 ./151107/3b5/9390f9c2ceee07f059a0d1f651e423b5
 11 ./109794/359/6b389b24749b7046344ffd2a42aab359
  9 ./248385/60c/f2907cb0b290def6f614bf46a715a60c
  5 ./222791/888/d991c8db1e2f1e724c1a4f52914f7888
  4 ./257772/140/fbbb1c017a841e6e821ed707025fe140
  4 ./231068/ca6/e1a706f50dd99f97fafeba6bd1f47ca6
  4 ./215734/80c/d2adbf087b09ca24cc546497d265180c
  3 ./248166/8b9/f259b3b0113b522ec5ef5753588438b9
  3 ./221060/383/d7e101fd86eeef65f8c89f3f99ce4383
  2 ./38609/203/25b46d47af8ee700e87ceab33748b203
  2 ./27961/a78/1b4e5665c16e1da8a70cc093a2bbba78
  2 ./158466/d43/9ac0b8bd4e592fe6f3360a731bac7d43
  2 ./141275/588/89f6efb554e964aeebffa9dad9e17588
  1 ./99980/fad/61a3214454426f7b30fc62773eb3bfad
  1 ./99980/fa4/61a311d12248a2e4ca8d2f61a3adafa4
  1 ./99980/ed9/61a32b429174f6099b0b35b0103e4ed9
  1 ./99980/e76/61a3129de66d7541ce127f07a7737e76
  1 ./99980/e3d/61a329332713fdacedb10834901f6e3d
  1 ./99980/e1f/61a306b87307636cc569bd0bc047ee1f
  1 ./99980/db8/61a30a00a9b7edd3318be25bbed47db8

root@stor010:/srv/node/sdc1/objects# du -sh
./151107/3b5/9390f9c2ceee07f059a0d1f651e423b5
957G./151107/3b5/9390f9c2ceee07f059a0d1f651e423b5

We're on version 2.5.0.7.

-- 
Cheers,
~Blairo

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators