date:20160120

[ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Götz Reinicke - IT Koordinator

Hi folks,

we plan to use more ssd OSDs in our first cluster layout instead of SAS
osds. (more IO is needed than space)

short question: What would influence the performance more? more Cores or
more GHz/Core.

Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)

If needed, I can give some more detailed information on the layout.

Thansk for feedback . Götz
-- 
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
Staatssekretär im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] how to use the setomapval to change rbd size info?

2016-01-20 Thread 张鹏

i want change the omapval of a rbd size  so i do some thing like :

1、create a rbd name zp3 with size 10G
[root@lab8106 rbdre]# rbd create zp3 --size 10G

2、see rbd information
[root@lab8106 rbdre]# rbd info zp3
rbd image 'zp3':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.39652e242dd4
format: 2
features: layering
flags:

3、inquire the rbd size omapval
[root@lab8106 rbdre]# rados -p rbd getomapval rbd_header.39652e242dd4 size
value (8 bytes) :
 : 00 00 00 80 02 00 00 00 : 

as i see  the value of size is  00 00 00 80 02 00 00 00   ;a hex dump
valume

4、set rbd size with a radom value 1 (i dont know how to choose
value do set it  that is my problem)
[root@lab8106 rbdre]# rados -p rbd setomapval rbd_header.39652e242dd4 size
1

5、inquire the rbd size omapval again
[root@lab8106 rbdre]#  rados -p rbd getomapval rbd_header.39652e242dd4 size
value (9 bytes) :
 : 31 31 31 31 31 31 31 31 31  : 1

6、inquire the rbd  size info  again
[root@lab8106 rbdre]# rbd info zp3
rbd image 'zp3':
size 3148 PB in 845114819781 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.39652e242dd4
format: 2
features: layering
flags:

=
so my question is  how can i set the rbd size  omapval to be:
 : 00 00 00 80 02 00 00 00
 rados -p rbd setomapval rbd_header.39652e242dd4 size  (value)

the value  how to write it?


thank you for your help
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Christian Balzer

Hello,

On Wed, 20 Jan 2016 10:01:19 +0100 Götz Reinicke - IT Koordinator wrote:

> Hi folks,
> 
> we plan to use more ssd OSDs in our first cluster layout instead of SAS
> osds. (more IO is needed than space)
> 
> short question: What would influence the performance more? more Cores or
> more GHz/Core.
> 
> Or is it as always: Depeds on the total of
> OSDs/nodes/repl-level/etc ... :)
>

While there certainly is a "depends" in there, my feeling is that faster
cores are more helpful than many, slower ones.
And this is how I spec'ed my first SSD nodes, 1 fast core (Intel, thus 2
pseudo-cores) per OSD.
The reasoning is simple, an individual OSD thread will run (hopefully) on
one core and thus be faster, with less latency(!).

> If needed, I can give some more detailed information on the layout.
> 
Might be interesting for other sanity checks, if you don't mind.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bucket type and crush map

2016-01-20 Thread Ivan Grcic

Hi Pedro,

you have to take your pool size into account, which is probably 3.
That way you get  840 * 3  / 6 = 420 ( PGs * PoolSize / OSD Num )

Please read: 
http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups

Regards,
Ivan

On Mon, Jan 18, 2016 at 9:18 PM, Pedro Benites  wrote:
> Hello,
>
> I have configured osd_crush_chooseleaf_type = 3 (rack), and I have 6 osd in
> three hosts and three racks, my tree y this:
>
> datacenter datacenter1
> -7  5.45999 rack rack1
> -2  5.45999 host storage1
>  0  2.73000 osd.0up 1.0  1.0
>  3  2.73000 osd.3up 1.0  1.0
> -8  5.45999 rack rack2
> -3  5.45999 host storage2
>  1  2.73000 osd.1up 1.0  1.0
>  4  2.73000 osd.4up 1.0  1.0
> -6  5.45999 datacenter datacenter2
> -9  5.45999 rack rack3
> -4  5.45999 host storage3
>  2  2.73000 osd.2up 1.0  1.0
>  5  2.73000 osd.5up 1.0  1.0
>
>
> But when I created my fourth pool I got the message "too many PGs per OSD
> (420 > max 300)"
> I dont understand that message because I have 840 PG and 6 OSD  or 140
> PGs/OSD,
> Why I got 420 in the warm?
>
>
> Regards,
> Pedro.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to use the setomapval to change rbd size info?

2016-01-20 Thread Ilya Dryomov

On Wed, Jan 20, 2016 at 10:48 AM, 张鹏  wrote:
> i want change the omapval of a rbd size  so i do some thing like :
>
> 1、create a rbd name zp3 with size 10G
> [root@lab8106 rbdre]# rbd create zp3 --size 10G
>
> 2、see rbd information
> [root@lab8106 rbdre]# rbd info zp3
> rbd image 'zp3':
> size 10240 MB in 2560 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.39652e242dd4
> format: 2
> features: layering
> flags:
>
> 3、inquire the rbd size omapval
> [root@lab8106 rbdre]# rados -p rbd getomapval rbd_header.39652e242dd4 size
> value (8 bytes) :
>  : 00 00 00 80 02 00 00 00 : 
>
> as i see  the value of size is  00 00 00 80 02 00 00 00   ;a hex dump valume
>
> 4、set rbd size with a radom value 1 (i dont know how to choose value
> do set it  that is my problem)
> [root@lab8106 rbdre]# rados -p rbd setomapval rbd_header.39652e242dd4 size
> 1
>
> 5、inquire the rbd size omapval again
> [root@lab8106 rbdre]#  rados -p rbd getomapval rbd_header.39652e242dd4 size
> value (9 bytes) :
>  : 31 31 31 31 31 31 31 31 31  : 1
>
> 6、inquire the rbd  size info  again
> [root@lab8106 rbdre]# rbd info zp3
> rbd image 'zp3':
> size 3148 PB in 845114819781 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.39652e242dd4
> format: 2
> features: layering
> flags:
>
> =
> so my question is  how can i set the rbd size  omapval to be:
>  : 00 00 00 80 02 00 00 00
>  rados -p rbd setomapval rbd_header.39652e242dd4 size  (value)
>
> the value  how to write it?

$ echo -en \\x00\\x00\\x00\\x80\\x02\\x00\\x00\\x00 | rados -p rbd
setomapval rbd_header.39652e242dd4 size

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Tomasz Kuzemko

Hi,
my team did some benchmarks in the past to answer this question. I don't
have results at hand, but conclusion was that it depends on how many
disks/OSDs you have in a single host: above 9 there was more benefit
from more cores than GHz (6-core 3.5GHz vs 10-core 2.4GHz AFAIR).

--
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com

On 20.01.2016 10:01, Götz Reinicke - IT Koordinator wrote:
> Hi folks,
> 
> we plan to use more ssd OSDs in our first cluster layout instead of SAS
> osds. (more IO is needed than space)
> 
> short question: What would influence the performance more? more Cores or
> more GHz/Core.
> 
> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
> 
> If needed, I can give some more detailed information on the layout.
> 
>   Thansk for feedback . Götz
> -- 
> Götz Reinicke
> IT-Koordinator
> 
> Tel. +49 7141 969 82420
> E-Mail goetz.reini...@filmakademie.de
> 
> Filmakademie Baden-Württemberg GmbH
> Akademiehof 10
> 71638 Ludwigsburg
> www.filmakademie.de
> 
> Eintragung Amtsgericht Stuttgart HRB 205016
> 
> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
> Staatssekretär im Ministerium für Wissenschaft,
> Forschung und Kunst Baden-Württemberg
> 
> Geschäftsführer: Prof. Thomas Schadt
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 20 January 2016 10:31
> To: ceph-us...@ceph.com
> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
> 
> 
> Hello,
> 
> On Wed, 20 Jan 2016 10:01:19 +0100 Götz Reinicke - IT Koordinator wrote:
> 
> > Hi folks,
> >
> > we plan to use more ssd OSDs in our first cluster layout instead of
> > SAS osds. (more IO is needed than space)
> >
> > short question: What would influence the performance more? more Cores
> > or more GHz/Core.
> >
> > Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc
> > ... :)
> >
> 
> While there certainly is a "depends" in there, my feeling is that faster cores
> are more helpful than many, slower ones

I would say it depends on if your objective is to get as much IO out of the SSD 
at high queue depths or if you need very low latency at low queue depths.

For the former, more cores is better as you can spread the requests over all 
the cores. The later needs very fast clock speeds. Maybe something like a Xeon 
E3 4x 3.6hz with 1 or two SSD's per node.

Of course there are chips with lots of cores and reasonably fast clock speeds, 
but expect to pay a lot for them.

.
> And this is how I spec'ed my first SSD nodes, 1 fast core (Intel, thus 2
> pseudo-cores) per OSD.
> The reasoning is simple, an individual OSD thread will run (hopefully) on one
> core and thus be faster, with less latency(!).
> 
> > If needed, I can give some more detailed information on the layout.
> >
> Might be interesting for other sanity checks, if you don't mind.
> 
> Regards,
> 
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CentOS 7 iscsi gateway using lrbd

2016-01-20 Thread Nick Fisk

Thanks for your input Mike, a couple of questions if I may

1. Are you saying that this rbd backing store is not in mainline and is only in 
SUSE kernels? Ie can I use this lrbd on Debian/Ubuntu/CentOS?
2. Does this have any positive effect on the abort/reset death loop a number of 
us were seeing when using LIO+krbd and ESXi?
3. Can you still use something like bcache over the krbd?



> -Original Message-
> From: Mike Christie [mailto:mchri...@redhat.com]
> Sent: 19 January 2016 21:34
> To: Василий Ангапов ; Ilya Dryomov
> 
> Cc: Nick Fisk ; Tyler Bishop
> ; Dominik Zalewski
> ; ceph-users 
> Subject: Re: [ceph-users] CentOS 7 iscsi gateway using lrbd
> 
> Everyone is right - sort of :)
> 
> It is that target_core_rbd module that I made that was rejected upstream,
> along with modifications from SUSE which added persistent reservations
> support. I also made some modifications to rbd so target_core_rbd and krbd
> could share code. target_core_rbd uses rbd like a lib. And it is also
> modifications to the targetcli related tool and libs, so you can use them to
> control the new rbd backend. SUSE's lrbd then handles setup/management
> of across multiple targets/gatways.
> 
> I was going to modify targetcli more and have the user just pass in the rbd
> info there, but did not get finished. That is why in that suse stuff you still
> make the krbd device like normal. You then pass that to the target_core_rbd
> module with targetcli and that is how that module knows about the rbd
> device.
> 
> The target_core_rbd module was rejected upstream, so I stopped
> development and am working on the approach suggested by those
> reviewers which instead of going from lio->target_core_rbd->krbd goes
> lio->target_core_iblock->linux block layer->krbd. With this approach you
> just use the normal old iblock driver and krbd and then I am modifying them
> to just work and do the right thing.
> 
> 
> On 01/19/2016 05:45 AM, Василий Ангапов wrote:
> > So is it a different approach that was used here by Mike Christie:
> > http://www.spinics.net/lists/target-devel/msg10330.html ?
> > It seems to be a confusion because it also implements target_core_rbd
> > module. Or not?
> >
> > 2016-01-19 18:01 GMT+08:00 Ilya Dryomov :
> >> On Tue, Jan 19, 2016 at 10:34 AM, Nick Fisk  wrote:
> >>> But interestingly enough, if you look down to where they run the
> targetcli ls, it shows a RBD backing store.
> >>>
> >>> Maybe it's using the krbd driver to actually do the Ceph side of the
> communication, but lio plugs into this rather than just talking to a dumb 
> block
> device???
> >>
> >> It does use krbd driver.
> >>
> >> Thanks,
> >>
> >> Ilya


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Oliver Dzombic

Hi,

Cores > Frequency

If you think about recovery / scrubbing tasks its better when a cpu core
can be assigned to do this.

Compared to a situation where the same cpu core needs to recovery/scrub
and still deliver the productive content at the same time.

The more you can create a situation where an osd has its "own" cpu core,
the better it is. Modern CPU's are anyway so fast, that even SSDs cant
run the CPU's to their limit.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 20.01.2016 um 10:01 schrieb Götz Reinicke - IT Koordinator:
> Hi folks,
> 
> we plan to use more ssd OSDs in our first cluster layout instead of SAS
> osds. (more IO is needed than space)
> 
> short question: What would influence the performance more? more Cores or
> more GHz/Core.
> 
> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
> 
> If needed, I can give some more detailed information on the layout.
> 
>   Thansk for feedback . Götz
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer

The OSD is able to use more than one core to do the work, so increasing the 
number of cores will increase throughput.
However, if you care about latency then that is always tied to speed=frequency.

If the question was "should I get 40GHz in 8 cores or in 16 cores" then the 
answer will always be "in 8 cores".
However, higher freq. CPUs are much pricies than a lower ticking onec with more 
cores,
so you will get a higher "throughput" for less $ if you scale the cores and not 
the frequency.

If you need to run more OSDs on one host than the number of cores this gets a 
bit tricky, because of NUMA and
linux scheduler that you should tune. If the number of OSDs is small enough I 
would always prefer the faster (frequency)
CPU over a lower one. 

Jan

> On 20 Jan 2016, at 13:01, Tomasz Kuzemko  wrote:
> 
> Hi,
> my team did some benchmarks in the past to answer this question. I don't
> have results at hand, but conclusion was that it depends on how many
> disks/OSDs you have in a single host: above 9 there was more benefit
> from more cores than GHz (6-core 3.5GHz vs 10-core 2.4GHz AFAIR).
> 
> --
> Tomasz Kuzemko
> tomasz.kuze...@corp.ovh.com
> 
> On 20.01.2016 10:01, Götz Reinicke - IT Koordinator wrote:
>> Hi folks,
>> 
>> we plan to use more ssd OSDs in our first cluster layout instead of SAS
>> osds. (more IO is needed than space)
>> 
>> short question: What would influence the performance more? more Cores or
>> more GHz/Core.
>> 
>> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
>> 
>> If needed, I can give some more detailed information on the layout.
>> 
>>  Thansk for feedback . Götz
>> -- 
>> Götz Reinicke
>> IT-Koordinator
>> 
>> Tel. +49 7141 969 82420
>> E-Mail goetz.reini...@filmakademie.de
>> 
>> Filmakademie Baden-Württemberg GmbH
>> Akademiehof 10
>> 71638 Ludwigsburg
>> www.filmakademie.de
>> 
>> Eintragung Amtsgericht Stuttgart HRB 205016
>> 
>> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
>> Staatssekretär im Ministerium für Wissenschaft,
>> Forschung und Kunst Baden-Württemberg
>> 
>> Geschäftsführer: Prof. Thomas Schadt
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer

This is very true, but do you actually exclusively pin the cores to the OSD 
daemons so they don't interfere?
I don't think may people do that, it wouldn't work with more than a handful of 
OSDs.
The OSD might typicaly only need <100% of one core, but during startup or some 
reshuffling it's beneficial
to allow it to get more (>400%), and that will interfere with whatever else was 
pinned there...

Jan

> On 20 Jan 2016, at 13:07, Oliver Dzombic  wrote:
> 
> Hi,
> 
> Cores > Frequency
> 
> If you think about recovery / scrubbing tasks its better when a cpu core
> can be assigned to do this.
> 
> Compared to a situation where the same cpu core needs to recovery/scrub
> and still deliver the productive content at the same time.
> 
> The more you can create a situation where an osd has its "own" cpu core,
> the better it is. Modern CPU's are anyway so fast, that even SSDs cant
> run the CPU's to their limit.
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 10:01 schrieb Götz Reinicke - IT Koordinator:
>> Hi folks,
>> 
>> we plan to use more ssd OSDs in our first cluster layout instead of SAS
>> osds. (more IO is needed than space)
>> 
>> short question: What would influence the performance more? more Cores or
>> more GHz/Core.
>> 
>> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
>> 
>> If needed, I can give some more detailed information on the layout.
>> 
>>  Thansk for feedback . Götz
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] S3 upload to RadosGW slows after few chunks

2016-01-20 Thread Rishiraj Rana

Hey guys, I am having an s3 upload to ceph issue where in the upload seems to 
crawl after the first few chunks for a multipart upload. The test file is 38M 
in size and the upload was tried with s3 default chunk size at 15M and then 
tried again with chunk size set to 5M and then was tested again with multipart 
disabled. Each time the upload halted at seeming random amount and I cannot 
seem to figure out why.

[10.0.144.2] $ /opt/utilities/s3cmd/s3cmd put ACCOUNT\ 
EXTRACT-GENERATED.PDF.test.2 's3://'

 -> s3:// /  [part 1 of 3, 15MB]

 15728640 of 15728640   100% in2s 7.43 MB/s  done

 -> s3:// /   [part 2 of 3, 15MB]

 4096 of 15728640 0% in0s31.68 kB/s

ERROR:  Upload of 'ACCOUNT EXTRACT-GENERATED.PDF.test.2' part 2 failed.

Use

  /  2~T2OI6bU-TYE_dOtm3tPInd tYTek-V0r

to abort the upload, or

 ./s3cmd --upload-id 2~T2OI6bU-TYE_dOtm3tPIndtYTek-V0r put ...

to continue the upload.

See ya!

Rishiraj Rana



Sent from Outlook Mobile
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Oliver Dzombic

Hi Jan,

actually the linux kernel does this automatically anyway ( sending new
processes to "empty/low used" cores ).

A single scrubbing/recovery or what ever process wont take more than
100% CPU ( one core ) because technically this processes are not able to
run multi thread.

Of course, if you configure your ceph to have ( up to ) 8 backfill
processes, then 8 processes will start, which can utilize ( up to ) 8
CPU cores.

But still, the single process wont be able to use more than one cpu core.

---

In a situation where you have 2x E5-2620v3 for example, you have 2x 6
Cores x 2 HT Units = 24 Threads ( vCores ).

So if you use inside such a system 24 OSD's every OSD will have (
mathematically ) its "own" CPU Core automatically.

Such a combination will perform better compared if you are using 1x E5
CPU with a much higher frequency ( but still the same amout of cores ).

This kind of CPU's are so fast, that the physical HDD ( no matter if
SAS/SSD/ATA ) will not be able to overload the cpu ( no matter which cpu
you use of this kind ).

Its like if you are playing games. If the game is running smooth, it
does not matter if its running on a 4 GHz machine on 40% utilization or
on a 2 GHz machine with 80% utilization. Is running smooth, it can not
do better :-)

So if your data is coming as fast as the HDD can physical deliver it,
its not important if the cpu runs with 2, 3, 4, 200 Ghz. Its already the
max of what the HDD can deliver.

So as long as the HDD's dont get faster, the CPU's does not need to be
faster.

The Ceph storage is usually just delivering data, not running a
commercial webserver/what ever beside that.

So if you are deciding what CPU you have to choose, you only have to
think about how fast your HDD devices are. So that the CPU does not
become the bottleneck.

And the more cores you have, the lower is the chance, that different
requests will block each other.

So all in all, Core > Frequency, always. ( As long as you use fast/up to
date CPUs ). If you are using old cpu's, of course you have to make sure
that the performance of the cpu ( which does by the way not only depend
on the frequency ) is sufficient that its not breaking the HDD data output.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 20.01.2016 um 13:10 schrieb Jan Schermer:
> This is very true, but do you actually exclusively pin the cores to the OSD 
> daemons so they don't interfere?
> I don't think may people do that, it wouldn't work with more than a handful 
> of OSDs.
> The OSD might typicaly only need <100% of one core, but during startup or 
> some reshuffling it's beneficial
> to allow it to get more (>400%), and that will interfere with whatever else 
> was pinned there...
> 
> Jan
> 
>> On 20 Jan 2016, at 13:07, Oliver Dzombic  wrote:
>>
>> Hi,
>>
>> Cores > Frequency
>>
>> If you think about recovery / scrubbing tasks its better when a cpu core
>> can be assigned to do this.
>>
>> Compared to a situation where the same cpu core needs to recovery/scrub
>> and still deliver the productive content at the same time.
>>
>> The more you can create a situation where an osd has its "own" cpu core,
>> the better it is. Modern CPU's are anyway so fast, that even SSDs cant
>> run the CPU's to their limit.
>>
>> -- 
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 20.01.2016 um 10:01 schrieb Götz Reinicke - IT Koordinator:
>>> Hi folks,
>>>
>>> we plan to use more ssd OSDs in our first cluster layout instead of SAS
>>> osds. (more IO is needed than space)
>>>
>>> short question: What would influence the performance more? more Cores or
>>> more GHz/Core.
>>>
>>> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
>>>
>>> If needed, I can give some more detailed information on the layout.
>>>
>>> Thansk for feedback . Götz
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer

I'm using Ceph with all SSDs, I doubt you have to worry about speed that
much with HDD (it will be abysmall either way).
With SSDs you need to start worrying about processor caches and memory
colocation in NUMA systems, linux scheduler is not really that smart right now.
Yes, the process will get its own core, but it might be a different core every
time it spins up, this increases latencies considerably if you start hammering
the OSDs on the same host.

But as always, YMMV ;-)

Jan


> On 20 Jan 2016, at 13:28, Oliver Dzombic  wrote:
> 
> Hi Jan,
> 
> actually the linux kernel does this automatically anyway ( sending new
> processes to "empty/low used" cores ).
> 
> A single scrubbing/recovery or what ever process wont take more than
> 100% CPU ( one core ) because technically this processes are not able to
> run multi thread.
> 
> Of course, if you configure your ceph to have ( up to ) 8 backfill
> processes, then 8 processes will start, which can utilize ( up to ) 8
> CPU cores.
> 
> But still, the single process wont be able to use more than one cpu core.
> 
> ---
> 
> In a situation where you have 2x E5-2620v3 for example, you have 2x 6
> Cores x 2 HT Units = 24 Threads ( vCores ).
> 
> So if you use inside such a system 24 OSD's every OSD will have (
> mathematically ) its "own" CPU Core automatically.
> 
> Such a combination will perform better compared if you are using 1x E5
> CPU with a much higher frequency ( but still the same amout of cores ).
> 
> This kind of CPU's are so fast, that the physical HDD ( no matter if
> SAS/SSD/ATA ) will not be able to overload the cpu ( no matter which cpu
> you use of this kind ).
> 
> Its like if you are playing games. If the game is running smooth, it
> does not matter if its running on a 4 GHz machine on 40% utilization or
> on a 2 GHz machine with 80% utilization. Is running smooth, it can not
> do better :-)
> 
> So if your data is coming as fast as the HDD can physical deliver it,
> its not important if the cpu runs with 2, 3, 4, 200 Ghz. Its already the
> max of what the HDD can deliver.
> 
> So as long as the HDD's dont get faster, the CPU's does not need to be
> faster.
> 
> The Ceph storage is usually just delivering data, not running a
> commercial webserver/what ever beside that.
> 
> So if you are deciding what CPU you have to choose, you only have to
> think about how fast your HDD devices are. So that the CPU does not
> become the bottleneck.
> 
> And the more cores you have, the lower is the chance, that different
> requests will block each other.
> 
> 
> 
> So all in all, Core > Frequency, always. ( As long as you use fast/up to
> date CPUs ). If you are using old cpu's, of course you have to make sure
> that the performance of the cpu ( which does by the way not only depend
> on the frequency ) is sufficient that its not breaking the HDD data output.
> 
> 
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 13:10 schrieb Jan Schermer:
>> This is very true, but do you actually exclusively pin the cores to the OSD 
>> daemons so they don't interfere?
>> I don't think may people do that, it wouldn't work with more than a handful 
>> of OSDs.
>> The OSD might typicaly only need <100% of one core, but during startup or 
>> some reshuffling it's beneficial
>> to allow it to get more (>400%), and that will interfere with whatever else 
>> was pinned there...
>> 
>> Jan
>> 
>>> On 20 Jan 2016, at 13:07, Oliver Dzombic  wrote:
>>> 
>>> Hi,
>>> 
>>> Cores > Frequency
>>> 
>>> If you think about recovery / scrubbing tasks its better when a cpu core
>>> can be assigned to do this.
>>> 
>>> Compared to a situation where the same cpu core needs to recovery/scrub
>>> and still deliver the productive content at the same time.
>>> 
>>> The more you can create a situation where an osd has its "own" cpu core,
>>> the better it is. Modern CPU's are anyway so fast, that even SSDs cant
>>> run the CPU's to their limit.
>>> 
>>> -- 
>>> Mit freundlichen Gruessen / Best regards
>>> 
>>> Oliver Dzombic
>>> IP-Interactive
>>> 
>>> mailto:i...@ip-interactive.de
>>> 
>>> Anschrift:
>>> 
>>> IP Interactive UG ( haftungsbeschraenkt )
>>> Zum Sonnenberg 1-3
>>> 63571 Gelnhausen
>>> 
>>> HRB 93402 beim Amtsgericht Hanau
>>> Geschäftsführung: Oliver Dzombic
>>> 
>>> Steuer Nr.: 35 236 3622 1
>>> UST ID: DE274086107
>>> 
>>> 
>>> Am 20.01.2016 um 10:01 schrieb Götz Reinicke - IT Koordinator:
 Hi folks,
 
 we plan to use more ssd OSDs in our first cluster layout instead of SAS
 osds. (more IO is needed than space)
 
 short question: What would influence the performance more? more Cores or
 more GHz/Core.
 
 Or is it as

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Wade Holler

Great commentary.

While it is fundamentally true that higher clock speed equals lower
latency, I'm my practical experience we are more often interested in
latency at the concurrency profile of the applications.

So in this regard I favor more cores when I have to choose, such that we
can support more concurrent operations at a queue depth of 0.

Cheers
Wade
On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer  wrote:

> I'm using Ceph with all SSDs, I doubt you have to worry about speed that
> much with HDD (it will be abysmall either way).
> With SSDs you need to start worrying about processor caches and memory
> colocation in NUMA systems, linux scheduler is not really that smart right
> now.
> Yes, the process will get its own core, but it might be a different core
> every
> time it spins up, this increases latencies considerably if you start
> hammering
> the OSDs on the same host.
>
> But as always, YMMV ;-)
>
> Jan
>
>
> > On 20 Jan 2016, at 13:28, Oliver Dzombic  wrote:
> >
> > Hi Jan,
> >
> > actually the linux kernel does this automatically anyway ( sending new
> > processes to "empty/low used" cores ).
> >
> > A single scrubbing/recovery or what ever process wont take more than
> > 100% CPU ( one core ) because technically this processes are not able to
> > run multi thread.
> >
> > Of course, if you configure your ceph to have ( up to ) 8 backfill
> > processes, then 8 processes will start, which can utilize ( up to ) 8
> > CPU cores.
> >
> > But still, the single process wont be able to use more than one cpu core.
> >
> > ---
> >
> > In a situation where you have 2x E5-2620v3 for example, you have 2x 6
> > Cores x 2 HT Units = 24 Threads ( vCores ).
> >
> > So if you use inside such a system 24 OSD's every OSD will have (
> > mathematically ) its "own" CPU Core automatically.
> >
> > Such a combination will perform better compared if you are using 1x E5
> > CPU with a much higher frequency ( but still the same amout of cores ).
> >
> > This kind of CPU's are so fast, that the physical HDD ( no matter if
> > SAS/SSD/ATA ) will not be able to overload the cpu ( no matter which cpu
> > you use of this kind ).
> >
> > Its like if you are playing games. If the game is running smooth, it
> > does not matter if its running on a 4 GHz machine on 40% utilization or
> > on a 2 GHz machine with 80% utilization. Is running smooth, it can not
> > do better :-)
> >
> > So if your data is coming as fast as the HDD can physical deliver it,
> > its not important if the cpu runs with 2, 3, 4, 200 Ghz. Its already the
> > max of what the HDD can deliver.
> >
> > So as long as the HDD's dont get faster, the CPU's does not need to be
> > faster.
> >
> > The Ceph storage is usually just delivering data, not running a
> > commercial webserver/what ever beside that.
> >
> > So if you are deciding what CPU you have to choose, you only have to
> > think about how fast your HDD devices are. So that the CPU does not
> > become the bottleneck.
> >
> > And the more cores you have, the lower is the chance, that different
> > requests will block each other.
> >
> > 
> >
> > So all in all, Core > Frequency, always. ( As long as you use fast/up to
> > date CPUs ). If you are using old cpu's, of course you have to make sure
> > that the performance of the cpu ( which does by the way not only depend
> > on the frequency ) is sufficient that its not breaking the HDD data
> output.
> >
> >
> >
> > --
> > Mit freundlichen Gruessen / Best regards
> >
> > Oliver Dzombic
> > IP-Interactive
> >
> > mailto:i...@ip-interactive.de
> >
> > Anschrift:
> >
> > IP Interactive UG ( haftungsbeschraenkt )
> > Zum Sonnenberg 1-3
> > 63571 Gelnhausen
> >
> > HRB 93402 beim Amtsgericht Hanau
> > Geschäftsführung: Oliver Dzombic
> >
> > Steuer Nr.: 35 236 3622 1
> > UST ID: DE274086107
> >
> >
> > Am 20.01.2016 um 13:10 schrieb Jan Schermer:
> >> This is very true, but do you actually exclusively pin the cores to the
> OSD daemons so they don't interfere?
> >> I don't think may people do that, it wouldn't work with more than a
> handful of OSDs.
> >> The OSD might typicaly only need <100% of one core, but during startup
> or some reshuffling it's beneficial
> >> to allow it to get more (>400%), and that will interfere with whatever
> else was pinned there...
> >>
> >> Jan
> >>
> >>> On 20 Jan 2016, at 13:07, Oliver Dzombic 
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Cores > Frequency
> >>>
> >>> If you think about recovery / scrubbing tasks its better when a cpu
> core
> >>> can be assigned to do this.
> >>>
> >>> Compared to a situation where the same cpu core needs to recovery/scrub
> >>> and still deliver the productive content at the same time.
> >>>
> >>> The more you can create a situation where an osd has its "own" cpu
> core,
> >>> the better it is. Modern CPU's are anyway so fast, that even SSDs cant
> >>> run the CPU's to their limit.
> >>>
> >>> --
> >>> Mit freundlichen Gruessen / Best regards
> >>>
> >>> Oliver Dzombic
> >>> IP-Interactive
> >>>

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Oliver Dzombic

Hi,

to be honest, i never made real benchmarks about that.

But to me, i doubt that the higher frequency of a cpu will have a "real"
impact on ceph's performance.

I mean, yes, mathematically, just like Wade pointed out, its true.
> frequency = < latency

But when we compare CPU's of the same model, with different frequencies.

How much time ( in nano seconds ), do we save ?
I mean i have really no numbers here.

But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5 /
High End Xeon E5 )
( when it comes to delay in "memory/what ever" allocation ), will be,
inside an Linux OS, quiet small. And i mean nano seconds tiny/non
existing small.
But again, thats just my guess. Of course, if we talk about complete
different CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have
different 1st/2nd level Caches in CPU, different
Architecture/RAM/everything.

But we are talking here about pure frequency issues. So we compare
identical CPU Models, just with different frequencies.

And there, the difference, especially inside an OS and inside a
productive environment must be nearly not existing.

I can not imagine how much an OSD / HDD needs to be hammered, that a
server is in general not totally overloaded and that the higher
frequency will make a measureable difference.

But again, i have here no numbers/benchmarks that could proove this pure
theory of mine.

In the very end, more cores will usually mean more GHz frequency in sum.

So maybe the whole discussion is very theoretically, because usually we
wont run in a situation where we have to choose frequency vs. cores.

Simply because more cores always means more frequency in sum.

Except you compare totally different cpu models and generations, and
this is even more worst theoretically and maybe pointless since the
different cpu generations have totally different inner architecture
which has a great impact in overall performance ( aside from numbers of
frequency and cores ).

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 20.01.2016 um 14:14 schrieb Wade Holler:
> Great commentary.
> 
> While it is fundamentally true that higher clock speed equals lower
> latency, I'm my practical experience we are more often interested in
> latency at the concurrency profile of the applications.
> 
> So in this regard I favor more cores when I have to choose, such that we
> can support more concurrent operations at a queue depth of 0.
> 
> Cheers
> Wade
> On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer  > wrote:
> 
> I'm using Ceph with all SSDs, I doubt you have to worry about speed that
> much with HDD (it will be abysmall either way).
> With SSDs you need to start worrying about processor caches and memory
> colocation in NUMA systems, linux scheduler is not really that smart
> right now.
> Yes, the process will get its own core, but it might be a different
> core every
> time it spins up, this increases latencies considerably if you start
> hammering
> the OSDs on the same host.
> 
> But as always, YMMV ;-)
> 
> Jan
> 
> 
> > On 20 Jan 2016, at 13:28, Oliver Dzombic  > wrote:
> >
> > Hi Jan,
> >
> > actually the linux kernel does this automatically anyway ( sending new
> > processes to "empty/low used" cores ).
> >
> > A single scrubbing/recovery or what ever process wont take more than
> > 100% CPU ( one core ) because technically this processes are not
> able to
> > run multi thread.
> >
> > Of course, if you configure your ceph to have ( up to ) 8 backfill
> > processes, then 8 processes will start, which can utilize ( up to ) 8
> > CPU cores.
> >
> > But still, the single process wont be able to use more than one
> cpu core.
> >
> > ---
> >
> > In a situation where you have 2x E5-2620v3 for example, you have 2x 6
> > Cores x 2 HT Units = 24 Threads ( vCores ).
> >
> > So if you use inside such a system 24 OSD's every OSD will have (
> > mathematically ) its "own" CPU Core automatically.
> >
> > Such a combination will perform better compared if you are using 1x E5
> > CPU with a much higher frequency ( but still the same amout of
> cores ).
> >
> > This kind of CPU's are so fast, that the physical HDD ( no matter if
> > SAS/SSD/ATA ) will not be able to overload the cpu ( no matter
> which cpu
> > you use of this kind ).
> >
> > Its like if you are playing games. If the game is running smooth, it
> > does not matter if its running on a 4 GHz machine on 40%
> utilization or
> > on a 2 GHz machine with 80% utilization. Is runni

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Mark Nelson

It depends is the right answer imho.  There are advantages to building 
smaller single-socket high frequency nodes.  The CPUs are cheap which 
helps off-set the lower density node cost, and as has been mentioned in 
this thread you don't have to deal with NUMA pinning and other annoying 
complications which ultimately can cost you more pain in the long run 
than it's worth.

On the other hand, if you are trying to squeeze as many SSDs into a 
single box as possible and know what you are doing regarding NUMA 
pinning, you'll probably benefit from dual CPUs with lots of cores.

Each kind of setup has it's place.  In our QA lab we just bought 
high-frequency single-core systems with a single P3700 NVMe to chew 
through the nightly Ceph testing.  We also have dual socket nodes with 
lots of cores, multiple NVMe drives, and multiple hard drives in the 
same box.

Mark

On 01/20/2016 07:14 AM, Wade Holler wrote:

Great commentary.

While it is fundamentally true that higher clock speed equals lower
latency, I'm my practical experience we are more often interested in
latency at the concurrency profile of the applications.

So in this regard I favor more cores when I have to choose, such that we
can support more concurrent operations at a queue depth of 0.

Cheers
Wade
On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer mailto:j...@schermer.cz>> wrote:

I'm using Ceph with all SSDs, I doubt you have to worry about speed that
much with HDD (it will be abysmall either way).
With SSDs you need to start worrying about processor caches and memory
colocation in NUMA systems, linux scheduler is not really that smart
right now.
Yes, the process will get its own core, but it might be a different
core every
time it spins up, this increases latencies considerably if you start
hammering
the OSDs on the same host.

But as always, YMMV ;-)

Jan

 > On 20 Jan 2016, at 13:28, Oliver Dzombic mailto:i...@ip-interactive.de>> wrote:
 >
 > Hi Jan,
 >
 > actually the linux kernel does this automatically anyway (
sending new
 > processes to "empty/low used" cores ).
 >
 > A single scrubbing/recovery or what ever process wont take more than
 > 100% CPU ( one core ) because technically this processes are not
able to
 > run multi thread.
 >
 > Of course, if you configure your ceph to have ( up to ) 8 backfill
 > processes, then 8 processes will start, which can utilize ( up to ) 8
 > CPU cores.
 >
 > But still, the single process wont be able to use more than one
cpu core.
 >
 > ---
 >
 > In a situation where you have 2x E5-2620v3 for example, you have 2x 6
 > Cores x 2 HT Units = 24 Threads ( vCores ).
 >
 > So if you use inside such a system 24 OSD's every OSD will have (
 > mathematically ) its "own" CPU Core automatically.
 >
 > Such a combination will perform better compared if you are using
1x E5
 > CPU with a much higher frequency ( but still the same amout of
cores ).
 >
 > This kind of CPU's are so fast, that the physical HDD ( no matter if
 > SAS/SSD/ATA ) will not be able to overload the cpu ( no matter
which cpu
 > you use of this kind ).
 >
 > Its like if you are playing games. If the game is running smooth, it
 > does not matter if its running on a 4 GHz machine on 40%
utilization or
 > on a 2 GHz machine with 80% utilization. Is running smooth, it
can not
 > do better :-)
 >
 > So if your data is coming as fast as the HDD can physical deliver it,
 > its not important if the cpu runs with 2, 3, 4, 200 Ghz. Its
already the
 > max of what the HDD can deliver.
 >
 > So as long as the HDD's dont get faster, the CPU's does not need
to be
 > faster.
 >
 > The Ceph storage is usually just delivering data, not running a
 > commercial webserver/what ever beside that.
 >
 > So if you are deciding what CPU you have to choose, you only have to
 > think about how fast your HDD devices are. So that the CPU does not
 > become the bottleneck.
 >
 > And the more cores you have, the lower is the chance, that different
 > requests will block each other.
 >
 > 
 >
 > So all in all, Core > Frequency, always. ( As long as you use
fast/up to
 > date CPUs ). If you are using old cpu's, of course you have to
make sure
 > that the performance of the cpu ( which does by the way not only
depend
 > on the frequency ) is sufficient that its not breaking the HDD
data output.
 >
 >
 >
 > --
 > Mit freundlichen Gruessen / Best regards
 >
 > Oliver Dzombic
 > IP-Interactive
 >
 > mailto:i...@ip-interactive.de 
 >
 > Anschrift:
 >
 > IP Interactive UG ( haftungsbeschraenkt )
 > Zum Sonnenberg 1-3
 > 63571 Gelnhausen
 >

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer

Let's separate those issues

1) NUMA, memory colocation, CPU-ethernet latencies, this is all pretty minor  
_until_ you hit some limit

this is something quite different tho

2) performance of a single IO hitting an OSD.
In my case (with old CEPH release, so possibly less optimal codepath) my IO 
latency is about 2ms for a single IO. *99%* of those 2ms is spent somewhere in 
OSD code, so if my CPU was overclocked to ~5GH latency would drop to 1ms - just 
like that.  This is on SSDs.
With HDDs, I would need to add ~7ms.
While you are waiting for this IO, you can do other work, and after a while you 
get that interrupt saying "hey I got your data".


With HDDs and a moderate workload , you are very unlikely to see an issue type 
1 - even if your memory is quite remote and has 2x the latency of local memory 
- this is absolutely insignificant compared to the latency of the HDD itself.
But once you move into SSD territory, this is a completely different story. 
Round trip times to/from memory, waiting for the scheduler to give time to the 
OSD, waiting for TLB cache to warm up, competing for QPI link bandwidth.. this 
all gets
much closer. You don't have that 7ms "gap" when you dispatch some other IO, or 
when you can do some maintenance or some larger "throughput" job like writing 
sequential IO for some time. You get that interrupt "I got your data" much much 
sooner. And every such
interruption makes the processor cache go away, shuffles some memory, spins up 
the scheduler to give priority to the original thread waiting for the data 
etc...

Essentialy, you start spending more time coordinating the work, and at some 
point performance drops sharply if you can't get all the work done in one 
timeslice - you get more and more interruptions which makes the problem even 
worse.
Of course it's not all that "dumb" and it will still work, but the performance 
you get can be fraction of what is possible if you can avoid it. Newer kernels 
help immensely (I had close to 3M context switches on old CentOS 6 kernel, 
while on newer Ubuntu kernel I get only 250K on the same hardware. And the real 
load is literally at 50% of what it was!)

With faster (freq) CPU, you are more likely to get the work done without 
interruption (and thus without *causing* another interruption later).

Jan




> On 20 Jan 2016, at 14:32, Oliver Dzombic  wrote:
> 
> Hi,
> 
> to be honest, i never made real benchmarks about that.
> 
> But to me, i doubt that the higher frequency of a cpu will have a "real"
> impact on ceph's performance.
> 
> I mean, yes, mathematically, just like Wade pointed out, its true.
>> frequency = < latency
> 
> But when we compare CPU's of the same model, with different frequencies.
> 
> How much time ( in nano seconds ), do we save ?
> I mean i have really no numbers here.
> 
> But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5 /
> High End Xeon E5 )
> ( when it comes to delay in "memory/what ever" allocation ), will be,
> inside an Linux OS, quiet small. And i mean nano seconds tiny/non
> existing small.
> But again, thats just my guess. Of course, if we talk about complete
> different CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have
> different 1st/2nd level Caches in CPU, different
> Architecture/RAM/everything.
> 
> But we are talking here about pure frequency issues. So we compare
> identical CPU Models, just with different frequencies.
> 
> And there, the difference, especially inside an OS and inside a
> productive environment must be nearly not existing.
> 
> I can not imagine how much an OSD / HDD needs to be hammered, that a
> server is in general not totally overloaded and that the higher
> frequency will make a measureable difference.
> 
> 
> 
> But again, i have here no numbers/benchmarks that could proove this pure
> theory of mine.
> 
> In the very end, more cores will usually mean more GHz frequency in sum.
> 
> So maybe the whole discussion is very theoretically, because usually we
> wont run in a situation where we have to choose frequency vs. cores.
> 
> Simply because more cores always means more frequency in sum.
> 
> Except you compare totally different cpu models and generations, and
> this is even more worst theoretically and maybe pointless since the
> different cpu generations have totally different inner architecture
> which has a great impact in overall performance ( aside from numbers of
> frequency and cores ).
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 14:14 schrieb Wade Holler:
>> Great commentary.
>> 
>> While it is fundamentally true that higher clock speed equals lower
>> latency, I'm my practical experience we are mor

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Götz Reinicke - IT Koordinator

Am 20.01.16 um 11:30 schrieb Christian Balzer:
> 
> Hello,
> 
> On Wed, 20 Jan 2016 10:01:19 +0100 Götz Reinicke - IT Koordinator wrote:
> 
>> Hi folks,
>>
>> we plan to use more ssd OSDs in our first cluster layout instead of SAS
>> osds. (more IO is needed than space)
>>
>> short question: What would influence the performance more? more Cores or
>> more GHz/Core.
>>
>> Or is it as always: Depeds on the total of
>> OSDs/nodes/repl-level/etc ... :)
>>
> 
> While there certainly is a "depends" in there, my feeling is that faster
> cores are more helpful than many, slower ones.
> And this is how I spec'ed my first SSD nodes, 1 fast core (Intel, thus 2
> pseudo-cores) per OSD.
> The reasoning is simple, an individual OSD thread will run (hopefully) on
> one core and thus be faster, with less latency(!).
> 
>> If needed, I can give some more detailed information on the layout.
>>
> Might be interesting for other sanity checks, if you don't mind.

With pleasure. The basic setup was: 6*24 SATA Nodes, simple
"replication" (size = 2), two datacenters, 20Gbit bonded LAN.

We calculated with 2x Intel Xeon E5-2630v3 8-Core 2,4GHz per OSD Node =
32 vCPUs each. (hypertreaded)

On the other hand, if we talk about one core per osd two E5-2620v3
6-Core would do per node.

But regarding the total cost we are faced with (upgrading network, more
fibers etc) ... the cost for the more cores do not hurt. And having some
cores "as spare" sounded good as we would install the mons on the OSD
node for now.

Later we discussed the design with an external consultant and added the
new and "OMG I forgot to tell you that we need" requirements and ended
up with:

three pool storage classes: fast and big more small IO, fast and not so
big (VM images) and cachetiered ec-pool bg videofiles.

Using the 6-Node 2 datacenter layout we ended up with a nice OSD
spreading and layout which sounds good to me.

The most important "new" thing was: using SSD for the pools, not just
for journal and OS.

So I ended up with the question, if the CPUs would have any good or bad
influence in that new SSDish-design.

Thanks for all the feedback and thoughts/comments so far.

CONCLUSION so far: From my POV the 2630v3 is good for us.

Cheers . Götz

-- 
Götz Reinicke
IT-Koordinator

Tel. +49 7141 969 82420
E-Mail goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
Staatssekretär im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt

smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to set a new Crushmap in production

2016-01-20 Thread Vincent Godin

Hi,

I need to import a new crushmap in production (the old one is the default
one) to define two datacenters and to isolate SSD from SATA disk. What is
the best way to do this without starting an hurricane on the platform ?

Till now, i was just using hosts (SATA OSD) on one datacenter with the
default rule so i create a new rule in the new crushmap to do the same job
on one datacenter on a defined SATA chassis. Here is the process I'm going
to follow but i really need your advice :

1 - set the noout flag
2 - import the new crushmap
3 - change the rule number for the existing pool to the new one
4 - unset the noout flag
5- pray ...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Nick Fisk

See this benchmark I did last year

http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Oliver Dzombic
> Sent: 20 January 2016 13:33
> To: ceph-us...@ceph.com
> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
> 
> Hi,
> 
> to be honest, i never made real benchmarks about that.
> 
> But to me, i doubt that the higher frequency of a cpu will have a "real"
> impact on ceph's performance.
> 
> I mean, yes, mathematically, just like Wade pointed out, its true.
> > frequency = < latency
> 
> But when we compare CPU's of the same model, with different frequencies.
> 
> How much time ( in nano seconds ), do we save ?
> I mean i have really no numbers here.
> 
> But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5 / High
> End Xeon E5 ) ( when it comes to delay in "memory/what ever" allocation ),
> will be, inside an Linux OS, quiet small. And i mean nano seconds tiny/non
> existing small.
> But again, thats just my guess. Of course, if we talk about complete different
> CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have different 1st/2nd
> level Caches in CPU, different Architecture/RAM/everything.
> 
> But we are talking here about pure frequency issues. So we compare
> identical CPU Models, just with different frequencies.
> 
> And there, the difference, especially inside an OS and inside a productive
> environment must be nearly not existing.
> 
> I can not imagine how much an OSD / HDD needs to be hammered, that a
> server is in general not totally overloaded and that the higher frequency will
> make a measureable difference.
> 
> 
> 
> But again, i have here no numbers/benchmarks that could proove this pure
> theory of mine.
> 
> In the very end, more cores will usually mean more GHz frequency in sum.
> 
> So maybe the whole discussion is very theoretically, because usually we
> wont run in a situation where we have to choose frequency vs. cores.
> 
> Simply because more cores always means more frequency in sum.
> 
> Except you compare totally different cpu models and generations, and this is
> even more worst theoretically and maybe pointless since the different cpu
> generations have totally different inner architecture which has a great impact
> in overall performance ( aside from numbers of frequency and cores ).
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 14:14 schrieb Wade Holler:
> > Great commentary.
> >
> > While it is fundamentally true that higher clock speed equals lower
> > latency, I'm my practical experience we are more often interested in
> > latency at the concurrency profile of the applications.
> >
> > So in this regard I favor more cores when I have to choose, such that
> > we can support more concurrent operations at a queue depth of 0.
> >
> > Cheers
> > Wade
> > On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer  > > wrote:
> >
> > I'm using Ceph with all SSDs, I doubt you have to worry about speed that
> > much with HDD (it will be abysmall either way).
> > With SSDs you need to start worrying about processor caches and
> memory
> > colocation in NUMA systems, linux scheduler is not really that smart
> > right now.
> > Yes, the process will get its own core, but it might be a different
> > core every
> > time it spins up, this increases latencies considerably if you start
> > hammering
> > the OSDs on the same host.
> >
> > But as always, YMMV ;-)
> >
> > Jan
> >
> >
> > > On 20 Jan 2016, at 13:28, Oliver Dzombic  > > wrote:
> > >
> > > Hi Jan,
> > >
> > > actually the linux kernel does this automatically anyway ( sending new
> > > processes to "empty/low used" cores ).
> > >
> > > A single scrubbing/recovery or what ever process wont take more than
> > > 100% CPU ( one core ) because technically this processes are not
> > able to
> > > run multi thread.
> > >
> > > Of course, if you configure your ceph to have ( up to ) 8 backfill
> > > processes, then 8 processes will start, which can utilize ( up to ) 8
> > > CPU cores.
> > >
> > > But still, the single process wont be able to use more than one
> > cpu core.
> > >
> > > ---
> > >
> > > In a situation where you have 2x E5-2620v3 for example, you have 2x 6
> > > Cores x 2 HT Units = 24 Threads ( vCores ).
> > >
> > > So if you use inside such a system 24 OSD's every OSD will have (
> > > mathematically ) its "own" CPU Core automatically.
> > >
>

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Mark Nelson

Excellent testing Nick!

Mark

On 01/20/2016 08:18 AM, Nick Fisk wrote:

See this benchmark I did last year

http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Oliver Dzombic
Sent: 20 January 2016 13:33
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz

Hi,

to be honest, i never made real benchmarks about that.

But to me, i doubt that the higher frequency of a cpu will have a "real"
impact on ceph's performance.

I mean, yes, mathematically, just like Wade pointed out, its true.

frequency = < latency

But when we compare CPU's of the same model, with different frequencies.

How much time ( in nano seconds ), do we save ?
I mean i have really no numbers here.

But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5 / High
End Xeon E5 ) ( when it comes to delay in "memory/what ever" allocation ),
will be, inside an Linux OS, quiet small. And i mean nano seconds tiny/non
existing small.
But again, thats just my guess. Of course, if we talk about complete different
CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have different 1st/2nd
level Caches in CPU, different Architecture/RAM/everything.

But we are talking here about pure frequency issues. So we compare
identical CPU Models, just with different frequencies.

And there, the difference, especially inside an OS and inside a productive
environment must be nearly not existing.

I can not imagine how much an OSD / HDD needs to be hammered, that a
server is in general not totally overloaded and that the higher frequency will
make a measureable difference.

But again, i have here no numbers/benchmarks that could proove this pure
theory of mine.

In the very end, more cores will usually mean more GHz frequency in sum.

So maybe the whole discussion is very theoretically, because usually we
wont run in a situation where we have to choose frequency vs. cores.

Simply because more cores always means more frequency in sum.

Except you compare totally different cpu models and generations, and this is
even more worst theoretically and maybe pointless since the different cpu
generations have totally different inner architecture which has a great impact
in overall performance ( aside from numbers of frequency and cores ).

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 20.01.2016 um 14:14 schrieb Wade Holler:

Great commentary.

While it is fundamentally true that higher clock speed equals lower
latency, I'm my practical experience we are more often interested in
latency at the concurrency profile of the applications.

So in this regard I favor more cores when I have to choose, such that
we can support more concurrent operations at a queue depth of 0.

Cheers
Wade
On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer mailto:j...@schermer.cz>> wrote:

 I'm using Ceph with all SSDs, I doubt you have to worry about speed that
 much with HDD (it will be abysmall either way).
 With SSDs you need to start worrying about processor caches and

memory

 colocation in NUMA systems, linux scheduler is not really that smart
 right now.
 Yes, the process will get its own core, but it might be a different
 core every
 time it spins up, this increases latencies considerably if you start
 hammering
 the OSDs on the same host.

 But as always, YMMV ;-)

 Jan

 > On 20 Jan 2016, at 13:28, Oliver Dzombic mailto:i...@ip-interactive.de>> wrote:
 >
 > Hi Jan,
 >
 > actually the linux kernel does this automatically anyway ( sending new
 > processes to "empty/low used" cores ).
 >
 > A single scrubbing/recovery or what ever process wont take more than
 > 100% CPU ( one core ) because technically this processes are not
 able to
 > run multi thread.
 >
 > Of course, if you configure your ceph to have ( up to ) 8 backfill
 > processes, then 8 processes will start, which can utilize ( up to ) 8
 > CPU cores.
 >
 > But still, the single process wont be able to use more than one
 cpu core.
 >
 > ---
 >
 > In a situation where you have 2x E5-2620v3 for example, you have 2x 6
 > Cores x 2 HT Units = 24 Threads ( vCores ).
 >
 > So if you use inside such a system 24 OSD's every OSD will have (
 > mathematically ) its "own" CPU Core automatically.
 >
 > Such a combination will perform better compared if you are using 1x E5
 > CPU with a much higher frequency ( but still the same amout of
 cores ).
 >
 > This kind of CPU's are so fast, that the physical HDD ( no matter if
 > SAS/SSD/ATA ) will

Re: [ceph-users] CRUSH Rule Review - Not replicating correctly

2016-01-20 Thread deeepdish

Hi Robert,

Just wanted to let you know that after applying your crush suggestion and 
allowing cluster to rebalance itself, I now have symmetrical data distribution. 
  In keeping 5 monitors my rationale is availability.   I have 3 compute nodes 
+ 2 storage nodes.   I was thinking that making all of them a monitor would 
provide an additional backups.  Based on your earlier comments, can you provide 
guidance on how much latency is induced by having excess monitors deployed?

Thanks.


> On Jan 18, 2016, at 12:36 , Robert LeBlanc  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Not that I know of.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Mon, Jan 18, 2016 at 10:33 AM, deeepdish  wrote:
>> Thanks Robert.   Will definitely try this.   Is there a way to implement 
>> “gradual CRUSH” changes?   I noticed whenever cluster wide changes are 
>> pushed (crush map, for instance) the cluster immediately attempts to align 
>> itself disrupting client access / performance…
>> 
>> 
>>> On Jan 18, 2016, at 12:22 , Robert LeBlanc  wrote:
>>> 
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>> 
>>> I'm not sure why you have six monitors. Six monitors buys you nothing
>>> over five monitors other than more power being used, and more latency
>>> and more headache. See
>>> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
>>>  
>>> 
>>> for some more info. Also, I'd consider 5 monitors overkill for this
>>> size cluster, I'd recommend three.
>>> 
>>> Although this is most likely not the root cause of your problem, you
>>> probably have an error here: "root replicated-T1" is pointing to
>>> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
>>> b02s12. You probably meant to have "root replicated-T1" pointing to
>>> erbus instead.
>>> 
>>> Where I think your problem is, is in your "rule replicated" section.
>>> You can try:
>>> step take replicated-T1
>>> step choose firstn 2 type host
>>> step chooseleaf firstn 2 type osdgroup
>>> step emit
>>> 
>>> What this does is choose two hosts from the root replicated-T1 (which
>>> happens to be both hosts you have), then chooses an OSD from two
>>> osdgroups on each host.
>>> 
>>> I believe the problem with your current rule set is that firstn 0 type
>>> host tries to select four hosts, but only two are available. You
>>> should be able to see that with 'ceph pg dump', where only two osds
>>> will be listed in the up set.
>>> 
>>> I hope that helps.
>>> -BEGIN PGP SIGNATURE-
>>> Version: Mailvelope v1.3.3
>>> Comment: https://www.mailvelope.com 
>>> 
>>> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
>>> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
>>> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
>>> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
>>> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
>>> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
>>> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
>>> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
>>> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
>>> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
>>> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
>>> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
>>> RKkQ
>>> =gp/0
>>> -END PGP SIGNATURE-
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
 Hi Everyone,
 
 Looking for a double check of my logic and crush map..
 
 Overview:
 
 - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
 SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
 osdgroup bucket.   Each host has 4 osdgroups.
 - 6 monitors
 - Two node cluster
 - Each node:
 - 20 OSDs
 -  4 SSDs
 - 4 osdgroups
 
 Desired Crush Rule outcome:
 - Assuming a pool with min_size=2 and size=4, all each node would contain a
 redundant copy of each object.   Should any of the hosts fail, access to
 data would be uninterrupted.
 
 Current Crush Rule outcome:
 - There are 4 copies of each object, however I don’t believe each node has 
 a
 redundant copy of each object, when a node fails, data is NOT accessible
 until ceph rebuilds itself / node becomes accessible again.
 
 I susepct my crush is not right, and to remedy it may take some time and
 cause cluster to be unresponsive / unavailable.Is there a way / method
 to apply substantial crush changes gradually to a cluster?
 
 Thanks for your help.
 
>>

[ceph-users] ceph fuse closing stale session while still operable

2016-01-20 Thread Oliver Dzombic

Hi,

i am testing on centos 6 x64 minimal install.

i am mounting successfully:

ceph-fuse -m 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789,10.0.0.4:6789
/ceph-storage/


[root@cn201 log]# df
Filesystem1K-blocksUsed   Available Use% Mounted on
/dev/sda1  74454192 122864469436748   2% /
tmpfs  16433588   016433588   0% /dev/shm
ceph-fuse  104468783104 55774867456 48693915648  54% /ceph-storage


Its all fine.

Then i start a (bigger) write:

dd if=/dev/zero bs=256M count=16 of=/ceph-storage/test/dd1

After a second it reaches:

[root@cn201 test]# ls -la /ceph-storage/test/dd1
-rw-r--r-- 1 root root 104726528 Jan 20 15:34 /ceph-storage/test/dd1

and remains there, no byte further.

#ps ax shows:

 1573 pts/0S+ 0:00 dd if=/dev/zero bs=256M count=16
of=/ceph-storage/test/dd1


---

The Kernellog just shows:

fuse init (API version 7.14)

after the mount.


---

on a ceph clusternode i can see:


[root@ceph2 ceph]# cat ceph-mds.ceph2.log
2016-01-20 15:34:07.728239 7f3832ddb700  0 log_channel(cluster) log
[INF] : closing stale session client.21176728 10.0.0.91:0/1635 after
301.302291


But still i can work with the mount. df, ls, even touch  works
perfectly. Just writing bigger amounts of data somehow freeze.



I had this issue already with my last tests with

Centos 7
Debian 7
Debian 8
Ubuntu 14

All in x64




So as i think that this is no general bug, i assume i have a setup mistake.

So this is my setup for the current setup with centos 6:

1. centos netinstall x64 minimal
2. yum update -y
3.
rpm -i
http://download.ceph.com/rpm-hammer/el6/noarch/ceph-release-1-1.el6.noarch.rpm

rpm -i
http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

yum -y install yum-plugin-priorities

sed -i -e "s/enabled=1/enabled=1\npriority=1/g" /etc/yum.repos.d/ceph.repo

yum -y install ceph-fuse

4. deactivate selinux
5. network config:

[root@cn201 test]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
HWADDR=0C:C4:7A:16:EE:3F
TYPE=Ethernet
UUID=19df403f-c1f2-4a39-a458-5596af108ca6
BOOTPROTO=none
ONBOOT=yes
IPADDR0="10.0.0.91"
PREFIX0="24"
MTU=9000


6. Copy ceph.client.admin.keyring to /etc/ceph/

7. mountint:

#ceph-fuse -m 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789,10.0.0.4:6789
/ceph-storage

8. testing:

#dd if=/dev/zero bs=256M count=16 of=/ceph-storage/test/dd1


---


So before i switch now all in debug mode:

Anyone any idea ? At least theoretically all fine and should work ?

Thank you !



-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] jemalloc-enabled packages on trusty?

2016-01-20 Thread Zoltan Arnold Nagy

Hi,

Has someone published prebuilt debs for trusty from hammer with jemalloc 
compiled-in instead of tcmalloc or does everybody need to compile it 
themselves? :-)

Cheers,
Zoltan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Wido den Hollander

Hello,

I have an issue with a (not in production!) Ceph cluster which I'm
trying to resolve.

On Friday the network links between the racks failed and this caused all
monitors to loose connection.

Their leveldb stores kept growing and they are currently 100% full. They
all have a few hunderd MB left.

Starting the 'compact on start' doesn't work since the FS is 100%
full.error: monitor data filesystem reached concerning levels of
available storage space (available: 0% 238 MB)
you may adjust 'mon data avail crit' to a lower value to make this go
away (default: 0%)

On of the 5 monitors is now running but that's not enough.

Any ideas how to compact this leveldb? I can't free up any more space
right now on these systems. Getting bigger disks in is also going to
take a lot of time.

Any tools outside the monitors to use here?

Keep in mind, this is a pre-production cluster. We would like to keep
the cluster and fix this as a good exercise of stuff which could go
wrong. Dangerous tools are allowed!

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Zoltan Arnold Nagy

Hi Wido,

So one out of the 5 monitors are running fine then? Did that have more space 
for it’s leveldb?

> On 20 Jan 2016, at 16:15, Wido den Hollander  wrote:
> 
> Hello,
> 
> I have an issue with a (not in production!) Ceph cluster which I'm
> trying to resolve.
> 
> On Friday the network links between the racks failed and this caused all
> monitors to loose connection.
> 
> Their leveldb stores kept growing and they are currently 100% full. They
> all have a few hunderd MB left.
> 
> Starting the 'compact on start' doesn't work since the FS is 100%
> full.error: monitor data filesystem reached concerning levels of
> available storage space (available: 0% 238 MB)
> you may adjust 'mon data avail crit' to a lower value to make this go
> away (default: 0%)
> 
> On of the 5 monitors is now running but that's not enough.
> 
> Any ideas how to compact this leveldb? I can't free up any more space
> right now on these systems. Getting bigger disks in is also going to
> take a lot of time.
> 
> Any tools outside the monitors to use here?
> 
> Keep in mind, this is a pre-production cluster. We would like to keep
> the cluster and fix this as a good exercise of stuff which could go
> wrong. Dangerous tools are allowed!
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Nick Fisk

Is there anything you can do with a USB key/NFS mount? Ie copy leveldb on to
it, remount in proper location, compact and then copy back to primary disk?

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: 20 January 2016 15:15
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph monitors 100% full filesystem, refusing start
> 
> Hello,
> 
> I have an issue with a (not in production!) Ceph cluster which I'm trying
to
> resolve.
> 
> On Friday the network links between the racks failed and this caused all
> monitors to loose connection.
> 
> Their leveldb stores kept growing and they are currently 100% full. They
all
> have a few hunderd MB left.
> 
> Starting the 'compact on start' doesn't work since the FS is 100%
> full.error: monitor data filesystem reached concerning levels of available
> storage space (available: 0% 238 MB) you may adjust 'mon data avail crit'
to a
> lower value to make this go away (default: 0%)
> 
> On of the 5 monitors is now running but that's not enough.
> 
> Any ideas how to compact this leveldb? I can't free up any more space
right
> now on these systems. Getting bigger disks in is also going to take a lot
of
> time.
> 
> Any tools outside the monitors to use here?
> 
> Keep in mind, this is a pre-production cluster. We would like to keep the
> cluster and fix this as a good exercise of stuff which could go wrong.
> Dangerous tools are allowed!
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Joao Eduardo Luis

On 01/20/2016 03:15 PM, Wido den Hollander wrote:
> Hello,
> 
> I have an issue with a (not in production!) Ceph cluster which I'm
> trying to resolve.
> 
> On Friday the network links between the racks failed and this caused all
> monitors to loose connection.
> 
> Their leveldb stores kept growing and they are currently 100% full. They
> all have a few hunderd MB left.

I'm incredibly curious to know what was written to leveldb to bring it
to grow unbounded. Did the monitors hold quorum? I'm guessing that would
be a 'no', given the network failure you mentioned, hence my morbid
curiosity in figuring out what happened there.

If you don't mind, running a 'ceph-kvstore-tool /path/to/store.db
leveldb list > /tmp/store.dump' could, maybe, shed some light on this
issue (at least it will dump all the keys, and maybe something will be
obvious, don't know). I'd certainly be interested in taking a look at
those stores if you don't mind ;)

> Starting the 'compact on start' doesn't work since the FS is 100%
> full.error: monitor data filesystem reached concerning levels of
> available storage space (available: 0% 238 MB)
> you may adjust 'mon data avail crit' to a lower value to make this go
> away (default: 0%)
> 
> On of the 5 monitors is now running but that's not enough.
> 
> Any ideas how to compact this leveldb? I can't free up any more space
> right now on these systems. Getting bigger disks in is also going to
> take a lot of time.

Running 'ceph-kvstore-tool' may also force leveldb to compact on open,
so you may have a shot there at compaction. If that doesn't work,
'ceph-monstore-tool' has a 'compact' command -- that should help you
sort it out.

  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wido den Hollander
> Sent: 20 January 2016 15:27
> To: Zoltan Arnold Nagy 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start
> 
> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
> > Hi Wido,
> >
> > So one out of the 5 monitors are running fine then? Did that have more
> space for it’s leveldb?
> >
> 
> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
> /var/log I was able to start it.
> 
> It compacted the levelDB database and is now on 1% disk usage.
> 
> Looking at the ceph_mon.cc code:
> 
> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
> 
> Setting mon_data_avail_crit to 0 does not work since 100% full is equal to 0%
> free..
> 
> There is ~300M free on the other 4 monitors. I just can't start the mon and
> tell it to compact.
> 
> Lessons learned here though, always make sure you have some additional
> space you can clear when you need it.

Slightly unrelated, but before the arrival of virtualisation,  when I used to 
manage MS Exchange servers we always used to copy a DVD ISO onto the DB/Logs 
disk, so that in the event of a disk full scenario we could always instantly 
free up 4GB of space. Maybe something along those lines (dd /dev/zero to a 
file) would be good practice.

> 
> >> On 20 Jan 2016, at 16:15, Wido den Hollander  wrote:
> >>
> >> Hello,
> >>
> >> I have an issue with a (not in production!) Ceph cluster which I'm
> >> trying to resolve.
> >>
> >> On Friday the network links between the racks failed and this caused
> >> all monitors to loose connection.
> >>
> >> Their leveldb stores kept growing and they are currently 100% full.
> >> They all have a few hunderd MB left.
> >>
> >> Starting the 'compact on start' doesn't work since the FS is 100%
> >> full.error: monitor data filesystem reached concerning levels of
> >> available storage space (available: 0% 238 MB) you may adjust 'mon
> >> data avail crit' to a lower value to make this go away (default: 0%)
> >>
> >> On of the 5 monitors is now running but that's not enough.
> >>
> >> Any ideas how to compact this leveldb? I can't free up any more space
> >> right now on these systems. Getting bigger disks in is also going to
> >> take a lot of time.
> >>
> >> Any tools outside the monitors to use here?
> >>
> >> Keep in mind, this is a pre-production cluster. We would like to keep
> >> the cluster and fix this as a good exercise of stuff which could go
> >> wrong. Dangerous tools are allowed!
> >>
> >> --
> >> Wido den Hollander
> >> 42on B.V.
> >> Ceph trainer and consultant
> >>
> >> Phone: +31 (0)20 700 9902
> >> Skype: contact42on
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> 
> 
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Wido den Hollander

On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
> Hi Wido,
> 
> So one out of the 5 monitors are running fine then? Did that have more space 
> for it’s leveldb?
> 

Yes. That was at 99% full and by cleaning some stuff in /var/cache and
/var/log I was able to start it.

It compacted the levelDB database and is now on 1% disk usage.

Looking at the ceph_mon.cc code:

if (stats.avail_percent <= g_conf->mon_data_avail_crit) {

Setting mon_data_avail_crit to 0 does not work since 100% full is equal
to 0% free..

There is ~300M free on the other 4 monitors. I just can't start the mon
and tell it to compact.

Lessons learned here though, always make sure you have some additional
space you can clear when you need it.

>> On 20 Jan 2016, at 16:15, Wido den Hollander  wrote:
>>
>> Hello,
>>
>> I have an issue with a (not in production!) Ceph cluster which I'm
>> trying to resolve.
>>
>> On Friday the network links between the racks failed and this caused all
>> monitors to loose connection.
>>
>> Their leveldb stores kept growing and they are currently 100% full. They
>> all have a few hunderd MB left.
>>
>> Starting the 'compact on start' doesn't work since the FS is 100%
>> full.error: monitor data filesystem reached concerning levels of
>> available storage space (available: 0% 238 MB)
>> you may adjust 'mon data avail crit' to a lower value to make this go
>> away (default: 0%)
>>
>> On of the 5 monitors is now running but that's not enough.
>>
>> Any ideas how to compact this leveldb? I can't free up any more space
>> right now on these systems. Getting bigger disks in is also going to
>> take a lot of time.
>>
>> Any tools outside the monitors to use here?
>>
>> Keep in mind, this is a pre-production cluster. We would like to keep
>> the cluster and fix this as a good exercise of stuff which could go
>> wrong. Dangerous tools are allowed!
>>
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Somnath Roy

Yes, thanks for the data..
BTW, Nick, do we know what is more important more cpu core or more frequency ?
For example, We have Xeon cpus available with a bit less frequency but with 
more cores /socket , so, which one we should be going with for OSD servers ?

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Wednesday, January 20, 2016 6:54 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz

Excellent testing Nick!

Mark

On 01/20/2016 08:18 AM, Nick Fisk wrote:
> See this benchmark I did last year
>
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
>> Of Oliver Dzombic
>> Sent: 20 January 2016 13:33
>> To: ceph-us...@ceph.com
>> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
>>
>> Hi,
>>
>> to be honest, i never made real benchmarks about that.
>>
>> But to me, i doubt that the higher frequency of a cpu will have a "real"
>> impact on ceph's performance.
>>
>> I mean, yes, mathematically, just like Wade pointed out, its true.
>>> frequency = < latency
>>
>> But when we compare CPU's of the same model, with different frequencies.
>>
>> How much time ( in nano seconds ), do we save ?
>> I mean i have really no numbers here.
>>
>> But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5 
>> / High End Xeon E5 ) ( when it comes to delay in "memory/what ever" 
>> allocation ), will be, inside an Linux OS, quiet small. And i mean 
>> nano seconds tiny/non existing small.
>> But again, thats just my guess. Of course, if we talk about complete 
>> different CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have 
>> different 1st/2nd level Caches in CPU, different Architecture/RAM/everything.
>>
>> But we are talking here about pure frequency issues. So we compare 
>> identical CPU Models, just with different frequencies.
>>
>> And there, the difference, especially inside an OS and inside a 
>> productive environment must be nearly not existing.
>>
>> I can not imagine how much an OSD / HDD needs to be hammered, that a 
>> server is in general not totally overloaded and that the higher 
>> frequency will make a measureable difference.
>>
>> 
>>
>> But again, i have here no numbers/benchmarks that could proove this 
>> pure theory of mine.
>>
>> In the very end, more cores will usually mean more GHz frequency in sum.
>>
>> So maybe the whole discussion is very theoretically, because usually 
>> we wont run in a situation where we have to choose frequency vs. cores.
>>
>> Simply because more cores always means more frequency in sum.
>>
>> Except you compare totally different cpu models and generations, and 
>> this is even more worst theoretically and maybe pointless since the 
>> different cpu generations have totally different inner architecture 
>> which has a great impact in overall performance ( aside from numbers of 
>> frequency and cores ).
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 20.01.2016 um 14:14 schrieb Wade Holler:
>>> Great commentary.
>>>
>>> While it is fundamentally true that higher clock speed equals lower 
>>> latency, I'm my practical experience we are more often interested in 
>>> latency at the concurrency profile of the applications.
>>>
>>> So in this regard I favor more cores when I have to choose, such 
>>> that we can support more concurrent operations at a queue depth of 0.
>>>
>>> Cheers
>>> Wade
>>> On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer >> > wrote:
>>>
>>>  I'm using Ceph with all SSDs, I doubt you have to worry about speed 
>>> that
>>>  much with HDD (it will be abysmall either way).
>>>  With SSDs you need to start worrying about processor caches and
>> memory
>>>  colocation in NUMA systems, linux scheduler is not really that smart
>>>  right now.
>>>  Yes, the process will get its own core, but it might be a different
>>>  core every
>>>  time it spins up, this increases latencies considerably if you start
>>>  hammering
>>>  the OSDs on the same host.
>>>
>>>  But as always, YMMV ;-)
>>>
>>>  Jan
>>>
>>>
>>>  > On 20 Jan 2016, at 13:28, Oliver Dzombic >>  > wrote:
>>>  >
>>>  > Hi Jan,
>>>  >
>>>  > actually the linux kernel does this automatically anyway ( sending 
>>> new
>>>  > processes to "empty/low used" cores ).
>>>  >
>>>  > A single scrubbing/recovery or what ever process wont take more than
>>>  > 100% C

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Wido den Hollander

On 01/20/2016 04:25 PM, Joao Eduardo Luis wrote:
> On 01/20/2016 03:15 PM, Wido den Hollander wrote:
>> Hello,
>>
>> I have an issue with a (not in production!) Ceph cluster which I'm
>> trying to resolve.
>>
>> On Friday the network links between the racks failed and this caused all
>> monitors to loose connection.
>>
>> Their leveldb stores kept growing and they are currently 100% full. They
>> all have a few hunderd MB left.
> 
> I'm incredibly curious to know what was written to leveldb to bring it
> to grow unbounded. Did the monitors hold quorum? I'm guessing that would
> be a 'no', given the network failure you mentioned, hence my morbid
> curiosity in figuring out what happened there.
> 

Yes, quorum got lost. Monitors are in different racks and the core
switching failed. Since it was pre-production people didn't notice until
Tuesday.

> If you don't mind, running a 'ceph-kvstore-tool /path/to/store.db
> leveldb list > /tmp/store.dump' could, maybe, shed some light on this
> issue (at least it will dump all the keys, and maybe something will be
> obvious, don't know). I'd certainly be interested in taking a look at
> those stores if you don't mind ;)
> 

This is a 1800 OSD cluster and a ceph-kvstore-tool  list shows me
a lot, but I mean, a lot of osdmaps.

I think that stuff failed horribly due to the network flapping.

Running just the list already compacted leveldb btw. I have free space
again and the monitors are starting. Waiting for them to form a quorum
again.

>> Starting the 'compact on start' doesn't work since the FS is 100%
>> full.error: monitor data filesystem reached concerning levels of
>> available storage space (available: 0% 238 MB)
>> you may adjust 'mon data avail crit' to a lower value to make this go
>> away (default: 0%)
>>
>> On of the 5 monitors is now running but that's not enough.
>>
>> Any ideas how to compact this leveldb? I can't free up any more space
>> right now on these systems. Getting bigger disks in is also going to
>> take a lot of time.
> 
> Running 'ceph-kvstore-tool' may also force leveldb to compact on open,
> so you may have a shot there at compaction. If that doesn't work,
> 'ceph-monstore-tool' has a 'compact' command -- that should help you
> sort it out.
> 
>   -Joao
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to set a new Crushmap in production

2016-01-20 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm not aware of a way of slowing things down other then modifying
osd_max_backfills, osd_backfill_scan_{min,max}, and
osd_recovery_max_activate as mentioned in [1]. The nature of injecting
a new CRUSH map us usually the result of several changes and I will do
this to prevent several restarts of backfills when a number of changes
needs to happen. I don't think setting noout will do anything for you
because your OSDs will not be going down with a CRUSH change. I didn't
realize that you could change the CRUSH rule on an existing pool, but
it is in the man page. You learn something new everyday.


[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg26017.html
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jan 20, 2016 at 7:11 AM, Vincent Godin  wrote:
> Hi,
>
> I need to import a new crushmap in production (the old one is the default
> one) to define two datacenters and to isolate SSD from SATA disk. What is
> the best way to do this without starting an hurricane on the platform ?
>
> Till now, i was just using hosts (SATA OSD) on one datacenter with the
> default rule so i create a new rule in the new crushmap to do the same job
> on one datacenter on a defined SATA chassis. Here is the process I'm going
> to follow but i really need your advice :
>
> 1 - set the noout flag
> 2 - import the new crushmap
> 3 - change the rule number for the existing pool to the new one
> 4 - unset the noout flag
> 5- pray ...
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWn7y8CRDmVDuy+mK58QAARAkQAKYdhhwzKAyCm4Fwv+4O
aWjLRqoqaJHVgKHZ8LigNlesFzxeB00nEsysUDsU/AoAzR+4RPYuFKneosYV
HY8Uri4QmChG0JAy/Dh/FffpH2LUmQJ2broo2p31V2ljLIgQl+Hd+8cf9hG/
muZ5DChfj4cRMmoWCcEltt6Oc23O1zGhi5VQRh1LY60jAA/EuVL0XZBLiMcU
Pio7RwH1ZrlJQnuorXEiZY31cgNRrd4UzdQlEMXBRPzU1aj0Tgr2mHikCv59
7Fi7iI0VQLI9LD4HpX84pBahFbHamrw1EI37QaYXJrEdRQmht1YIQJpD2eso
3K3fcuCsfKYCweRydpPAWlzZfeo400CN1qunwM0Bxcm54rvRTju81YzY1yv7
TH7DGphuOeOBRp+7utQzZ2uil1iTDMqNSMJ5tdPBWETqzxULuJKGX1uzCM/Y
zeE9wEfrKax3agYyi9cCqPTT9KhYB8BsPFAobO53a2j/c1dnqvIA0ToqEUyO
kqB0Ze7rG8ZOLKgRkj/ACqC14RnMBBVR3DtmQ6Lfs3aiokUx5IzAp8pR5JI4
J32uCAUVSuUXTmnrozFaxgLgel0HM9XqPiOeXlp2gfuukeb+ENfzNfJk2zTn
cwdf3HyjapRXtZKaHa6XEhoTuqznKDbOAdTlyxlvm/SfR84BW00HbXxAPa/G
/sFU
=sN8w
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Nick Fisk

Hi Somnath,

Unfortunately I don't have any figures or really any way to generate them to 
answer that question.

However my gut feeling is that it is heavily dependent on the type of workload. 
Faster Ghz will lower per request latency but only up to the point where the 
CPU's start getting busy. Once you get to that point, probably just having more 
total Ghz (ie cores x Ghz) is more important. But if you never generate a high 
enough queue depth, then you will never saturate all the cores and so will miss 
peak performance.

In the example in that article, I was using a queue depth of 1 and so was 
heavily dependent on frequency. In a lot of the tests you have been doing at 
256+ queue depth I would imagine having more total Ghz is better.

Benchmarks aside, in the real world a cluster which is fairly idle but serving 
OLTP workloads would probably be better suited to having very fast clocked 
cores and scale out with more nodes or sockets. For more batch processing type 
workloads where combined throughput is more important that individual requests 
I would put my money on lots of slower cores.

Of course the Turbo Boost is going to skew any testing you try and do, as even 
the 22 core monsters can scale individual cores up to ~3.5ghz at low loads. 
Obviously the problem with these is that they get slower turbo as you load them 
up, so your latency will rise more than expected as queue depth increases. 

HmmmI think I have confused myself now!!

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Somnath Roy
> Sent: 20 January 2016 16:00
> To: Mark Nelson ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
> 
> Yes, thanks for the data..
> BTW, Nick, do we know what is more important more cpu core or more
> frequency ?
> For example, We have Xeon cpus available with a bit less frequency but with
> more cores /socket , so, which one we should be going with for OSD servers
> ?
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: Wednesday, January 20, 2016 6:54 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
> 
> Excellent testing Nick!
> 
> Mark
> 
> On 01/20/2016 08:18 AM, Nick Fisk wrote:
> > See this benchmark I did last year
> >
> > http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Oliver Dzombic
> >> Sent: 20 January 2016 13:33
> >> To: ceph-us...@ceph.com
> >> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
> >>
> >> Hi,
> >>
> >> to be honest, i never made real benchmarks about that.
> >>
> >> But to me, i doubt that the higher frequency of a cpu will have a "real"
> >> impact on ceph's performance.
> >>
> >> I mean, yes, mathematically, just like Wade pointed out, its true.
> >>> frequency = < latency
> >>
> >> But when we compare CPU's of the same model, with different
> frequencies.
> >>
> >> How much time ( in nano seconds ), do we save ?
> >> I mean i have really no numbers here.
> >>
> >> But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5
> >> / High End Xeon E5 ) ( when it comes to delay in "memory/what ever"
> >> allocation ), will be, inside an Linux OS, quiet small. And i mean
> >> nano seconds tiny/non existing small.
> >> But again, thats just my guess. Of course, if we talk about complete
> >> different CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have
> >> different 1st/2nd level Caches in CPU, different
> Architecture/RAM/everything.
> >>
> >> But we are talking here about pure frequency issues. So we compare
> >> identical CPU Models, just with different frequencies.
> >>
> >> And there, the difference, especially inside an OS and inside a
> >> productive environment must be nearly not existing.
> >>
> >> I can not imagine how much an OSD / HDD needs to be hammered, that a
> >> server is in general not totally overloaded and that the higher
> >> frequency will make a measureable difference.
> >>
> >> 
> >>
> >> But again, i have here no numbers/benchmarks that could proove this
> >> pure theory of mine.
> >>
> >> In the very end, more cores will usually mean more GHz frequency in
> sum.
> >>
> >> So maybe the whole discussion is very theoretically, because usually
> >> we wont run in a situation where we have to choose frequency vs. cores.
> >>
> >> Simply because more cores always means more frequency in sum.
> >>
> >> Except you compare totally different cpu models and generations, and
> >> this is even more worst theoretically and maybe pointless since the
> >> different cpu generations have totally different inner architecture
> >> which has a great impact in overall performance ( aside from numbers of
> frequency and cores ).
> >>
> >> --
> >> Mit f

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I did some tests, this is a single OSD server with I think 2 Intel
3500 drives and replication 1 and the results are in IOPs. I think the
client was on a different host, but it has been a long time since I
did this test. I adjusted the number of cores on the same box and the
frequency of the processor. I also played with C-states for 1 and 8
cores. This showed me that in terms of IOPs, more cores is generally
better than GHz and sometimes restricting the C-states to C1 or lower
can impact performance.

In the C-State chart, "All" refers to all available c-states are in
use, "freq-range" refers to allowing the CPU to use any frequency
between the lowest available and the frequency listed in the row. If
"freq range" is not in the header, then the frequency was pinned at
the frequency listed in the row.

https://docs.google.com/a/leblancnet.us/spreadsheets/d/1hgXErxHTh3TiM9aZ2JQKJcFhlGWb4D5imdYNFiTIeeU/edit?usp=sharing
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWn8NxCRDmVDuy+mK58QAA1x8P/iae87JmwWhT7QGMZ4iw
jn7pjuW+Oqk7NRcsN1yofIAyg3FF5dm+bHqrnV8fQz8QqBwidZgOp9hVdj98
DtJtwjLlFdRrT9q7yveYRmMqj2Yv/CdpsG5YlIXUbzR2ss2UzVw4/P2rJ5jI
XcaMtPcT6M9EkNoYyho/Xy9AFOZX9Yyfq2r6m8/l57gSSA6jJ3wB2Q301VGs
Wo2vruomOjIAocukD0XXM9CoN1O02vlheG7K06aRphNSA1KaKDMdfv+MFaiG
anthyRwQsUquWQf0J9Vje+Ee9gSf5HX6qVuNmJf8cGJNKQgK2NYgA8eN5IsG
T2bZ95aIJ3XUroZyQxco7aaaMLy111rG8PLjkunRTWFgLnFv7eV99wQfXl+v
fy0YpTuLULUtrOA3k2UZUPdCZbBE3w1Y5/wsOhLeHzKkXWyO+nRL9wb34/it
j5Y8u1lN37YvnTQikzbvmYzgZj/OKFA+mr2SXHrB+G3g4k9arL/YDbkTOQjB
G3rlbhV1DHfSq0kNis119VW/ZUvyI+PiZySp+JiH5QwQ77P1NA7/O3V3qECC
tnUdMO8/wg97XF2xY6wDfEgD+EflqrUW3v9Pm1SSk3OMVLn+357tcL9IxaRM
+pXRCdw1VbeFn2ITKu+h8GiyGrY+CctNRB9rxeHJ9uMyhh+DcRUFXLAQf0YO
gvUX
=E4LJ
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jan 20, 2016 at 2:01 AM, Götz Reinicke - IT Koordinator
 wrote:
> Hi folks,
>
> we plan to use more ssd OSDs in our first cluster layout instead of SAS
> osds. (more IO is needed than space)
>
> short question: What would influence the performance more? more Cores or
> more GHz/Core.
>
> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
>
> If needed, I can give some more detailed information on the layout.
>
> Thansk for feedback . Götz
> --
> Götz Reinicke
> IT-Koordinator
>
> Tel. +49 7141 969 82420
> E-Mail goetz.reini...@filmakademie.de
>
> Filmakademie Baden-Württemberg GmbH
> Akademiehof 10
> 71638 Ludwigsburg
> www.filmakademie.de
>
> Eintragung Amtsgericht Stuttgart HRB 205016
>
> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
> Staatssekretär im Ministerium für Wissenschaft,
> Forschung und Kunst Baden-Württemberg
>
> Geschäftsführer: Prof. Thomas Schadt
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

2016-01-20 Thread seapasu...@uchicago.edu




On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:

On Fri, Jan 15, 2016 at 5:04 PM, seapasu...@uchicago.edu
 wrote:

I have looked all over and I do not see any explicit mention of
"NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959" in the logs nor do I
see a timestamp from November 4th although I do see log rotations dating
back to october 15th. I don't think it's possible it wasn't logged so I am
going through the bucket logs from the 'radosgw-admin log show --object'
side and I found the following::

4604932 {
4604933 "bucket": "noaa-nexrad-l2",
4604934 "time": "2015-11-04 21:29:27.346509Z",
4604935 "time_local": "2015-11-04 15:29:27.346509",
4604936 "remote_addr": "",
4604937 "object_owner": "b05f707271774dbd89674a0736c9406e",
4604938 "user": "b05f707271774dbd89674a0736c9406e",
4604939 "operation": "PUT",

I'd expect a multipart upload completion to be done with a POST, not a PUT.

Indeed it seems really weird.



4604940 "uri":
"\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959.tar",
4604941 "http_status": "200",
4604942 "error_code": "",
4604943 "bytes_sent": 19,
4604944 "bytes_received": 0,
4604945 "object_size": 0,

Do you see a zero object_size for other multipart uploads?
I think so. I still don't know how to tell for certain if a radosgw 
object is a multipart object or not. I think all of the objects in 
noaa-nexrad-l2 bucket are multipart::


./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-{
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket": 
"noaa-nexrad-l2",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time": "2015-10-16 
19:49:30.579738Z",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local": 
"2015-10-16 14:49:30.579738",

./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user": 
"b05f707271774dbd89674a0736c9406e",

./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation": "POST",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri": 
"\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar",

./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status": "200",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received": 152,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent": 
"Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",

./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-}

The objects above 
(NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar) pulls down 
without an issue though. Below is a paste for object 
"NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar" which 404's::

http://pastebin.com/Jtw8z7G4

I see two posts one recorded a minute before for this object both with 0 
size though. Does this help at all?


Yehuda


4604946 "total_time": 142640400,
4604947 "user_agent": "Boto\/2.38.0 Python\/2.7.7
Linux\/2.6.32-573.7.1.el6.x86_64",
4604948 "referrer": ""
4604949 }

Does this help at all. The total time seems exceptionally high. Would it be
possible that there is a timeout issue where the put request started a
multipart upload with the correct header and then timed out but the radosgw
took the data anyway?

I am surprised the radosgw returned a 200 let alone placed the key in the
bucket listing.


That said here is another object (different object) that 404s:
1650873 {
1650874 "bucket": "noaa-nexrad-l2",
1650875 "time": "2015-11-05 04:50:42.606838Z",
1650876 "time_local": "2015-11-04 22:50:42.606838",
1650877 "remote_addr": "",
1650878 "object_owner": "b05f707271774dbd89674a0736c9406e",
1650879 "user": "b05f707271774dbd89674a0736c9406e",
1650880 "operation": "PUT",
1650881 "uri":
"\/noaa-nexrad-l2\/2015\/02\/25\/KVBX\/NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar",
1650882 "http_status": "200",
1650883 "error_code": "",
1650884 "bytes_sent": 19,
1650885 "bytes_received": 0,
1650886 "object_size": 0,
1650887 "total_time": 0,
1650888 "user_agent": "Boto\/2.38.0 Python\/2.7.7
Linux\/2.6.32-573.7.1.el6.x86_64",
1650889 "referrer": ""
1650890 }

And this one fails with a 404 as well. Does this help at all? Here is a
successful objec

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Zoltan Arnold Nagy

Wouldn’t actually blowing away the other monitors then recreating them from 
scratch solve the issue?

Never done this, just thinking out loud. It would grab the osdmap and 
everything from the other monitor and form a quorum, wouldn’t it?

> On 20 Jan 2016, at 16:26, Wido den Hollander  wrote:
> 
> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
>> Hi Wido,
>> 
>> So one out of the 5 monitors are running fine then? Did that have more space 
>> for it’s leveldb?
>> 
> 
> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
> /var/log I was able to start it.
> 
> It compacted the levelDB database and is now on 1% disk usage.
> 
> Looking at the ceph_mon.cc code:
> 
> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
> 
> Setting mon_data_avail_crit to 0 does not work since 100% full is equal
> to 0% free..
> 
> There is ~300M free on the other 4 monitors. I just can't start the mon
> and tell it to compact.
> 
> Lessons learned here though, always make sure you have some additional
> space you can clear when you need it.
> 
>>> On 20 Jan 2016, at 16:15, Wido den Hollander  wrote:
>>> 
>>> Hello,
>>> 
>>> I have an issue with a (not in production!) Ceph cluster which I'm
>>> trying to resolve.
>>> 
>>> On Friday the network links between the racks failed and this caused all
>>> monitors to loose connection.
>>> 
>>> Their leveldb stores kept growing and they are currently 100% full. They
>>> all have a few hunderd MB left.
>>> 
>>> Starting the 'compact on start' doesn't work since the FS is 100%
>>> full.error: monitor data filesystem reached concerning levels of
>>> available storage space (available: 0% 238 MB)
>>> you may adjust 'mon data avail crit' to a lower value to make this go
>>> away (default: 0%)
>>> 
>>> On of the 5 monitors is now running but that's not enough.
>>> 
>>> Any ideas how to compact this leveldb? I can't free up any more space
>>> right now on these systems. Getting bigger disks in is also going to
>>> take a lot of time.
>>> 
>>> Any tools outside the monitors to use here?
>>> 
>>> Keep in mind, this is a pre-production cluster. We would like to keep
>>> the cluster and fix this as a good exercise of stuff which could go
>>> wrong. Dangerous tools are allowed!
>>> 
>>> -- 
>>> Wido den Hollander
>>> 42on B.V.
>>> Ceph trainer and consultant
>>> 
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

2016-01-20 Thread Yehuda Sadeh-Weinraub

On Wed, Jan 20, 2016 at 10:43 AM, seapasu...@uchicago.edu
 wrote:
>
>
> On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> On Fri, Jan 15, 2016 at 5:04 PM, seapasu...@uchicago.edu
>>  wrote:
>>>
>>> I have looked all over and I do not see any explicit mention of
>>> "NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959" in the logs nor do
>>> I
>>> see a timestamp from November 4th although I do see log rotations dating
>>> back to october 15th. I don't think it's possible it wasn't logged so I
>>> am
>>> going through the bucket logs from the 'radosgw-admin log show --object'
>>> side and I found the following::
>>>
>>> 4604932 {
>>> 4604933 "bucket": "noaa-nexrad-l2",
>>> 4604934 "time": "2015-11-04 21:29:27.346509Z",
>>> 4604935 "time_local": "2015-11-04 15:29:27.346509",
>>> 4604936 "remote_addr": "",
>>> 4604937 "object_owner": "b05f707271774dbd89674a0736c9406e",
>>> 4604938 "user": "b05f707271774dbd89674a0736c9406e",
>>> 4604939 "operation": "PUT",
>>
>> I'd expect a multipart upload completion to be done with a POST, not a
>> PUT.
>
> Indeed it seems really weird.
>>
>>
>>> 4604940 "uri":
>>>
>>> "\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959.tar",
>>> 4604941 "http_status": "200",
>>> 4604942 "error_code": "",
>>> 4604943 "bytes_sent": 19,
>>> 4604944 "bytes_received": 0,
>>> 4604945 "object_size": 0,
>>
>> Do you see a zero object_size for other multipart uploads?
>
> I think so. I still don't know how to tell for certain if a radosgw object
> is a multipart object or not. I think all of the objects in noaa-nexrad-l2
> bucket are multipart::
>
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-{
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket":
> "noaa-nexrad-l2",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time": "2015-10-16
> 19:49:30.579738Z",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local":
> "2015-10-16 14:49:30.579738",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user":
> "b05f707271774dbd89674a0736c9406e",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation": "POST",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri":
> "\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status": "200",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received": 152,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent":
> "Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-}
>
> The objects above (NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar)
> pulls down without an issue though. Below is a paste for object
> "NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar" which 404's::
> http://pastebin.com/Jtw8z7G4

Sadly the log doesn't provide all the input, but I can guess what the
operations were:

 - POST (init multipart upload)
 - PUT (upload part)
 - GET (list parts)
 - POST (complete multipart) <-- took > 57 seconds to process
 - POST (complete multipart)
 - HEAD (stat object)

For some reason the complete multipart operation took too long, which
I think triggered a client retry (either that, or an abort). Then
there were two completions racing (or a complete and abort), which
might have caused the issue we're seeing for some reason. E.g., two
completions might have ended up with the second completion noticing
that it's overwriting an existing object (that we just created),
sending the 'old' object to be garbage collected, when that object's
tail is actually its own tail.


>
> I see two posts one recorded a minute before for this object both with 0
> size though. Does this help at all?

Yes, very much

Thanks,
Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Wido den Hollander

On 01/20/2016 08:01 PM, Zoltan Arnold Nagy wrote:
> Wouldn’t actually blowing away the other monitors then recreating them
> from scratch solve the issue?
> 
> Never done this, just thinking out loud. It would grab the osdmap and
> everything from the other monitor and form a quorum, wouldn’t it?
> 

Nope, those monitors will not have any historical OSDMaps which will be
required by OSDs which need to catch up with the cluster.

It might be possible technically by hacking a lot of stuff, but that
won't be easy.

I'm still busy with this btw. The monitors are in a electing state since
2 monitors are still synchronizing and one won't boot anymore :(

>> On 20 Jan 2016, at 16:26, Wido den Hollander > > wrote:
>>
>> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
>>> Hi Wido,
>>>
>>> So one out of the 5 monitors are running fine then? Did that have
>>> more space for it’s leveldb?
>>>
>>
>> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
>> /var/log I was able to start it.
>>
>> It compacted the levelDB database and is now on 1% disk usage.
>>
>> Looking at the ceph_mon.cc code:
>>
>> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
>>
>> Setting mon_data_avail_crit to 0 does not work since 100% full is equal
>> to 0% free..
>>
>> There is ~300M free on the other 4 monitors. I just can't start the mon
>> and tell it to compact.
>>
>> Lessons learned here though, always make sure you have some additional
>> space you can clear when you need it.
>>
 On 20 Jan 2016, at 16:15, Wido den Hollander >>> > wrote:

 Hello,

 I have an issue with a (not in production!) Ceph cluster which I'm
 trying to resolve.

 On Friday the network links between the racks failed and this caused all
 monitors to loose connection.

 Their leveldb stores kept growing and they are currently 100% full. They
 all have a few hunderd MB left.

 Starting the 'compact on start' doesn't work since the FS is 100%
 full.error: monitor data filesystem reached concerning levels of
 available storage space (available: 0% 238 MB)
 you may adjust 'mon data avail crit' to a lower value to make this go
 away (default: 0%)

 On of the 5 monitors is now running but that's not enough.

 Any ideas how to compact this leveldb? I can't free up any more space
 right now on these systems. Getting bigger disks in is also going to
 take a lot of time.

 Any tools outside the monitors to use here?

 Keep in mind, this is a pre-production cluster. We would like to keep
 the cluster and fix this as a good exercise of stuff which could go
 wrong. Dangerous tools are allowed!

 -- 
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>>
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

2016-01-20 Thread seapasu...@uchicago.edu

So is there any way to prevent this from happening going forward? I mean 
ideally this should never be possible, right? Even with a complete 
object that is 0 bytes it should be downloaded as 0 bytes and have a 
different md5sum and not report as 7mb?



On 1/20/16 1:30 PM, Yehuda Sadeh-Weinraub wrote:

On Wed, Jan 20, 2016 at 10:43 AM, seapasu...@uchicago.edu
 wrote:


On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:

On Fri, Jan 15, 2016 at 5:04 PM, seapasu...@uchicago.edu
 wrote:

I have looked all over and I do not see any explicit mention of
"NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959" in the logs nor do
I
see a timestamp from November 4th although I do see log rotations dating
back to october 15th. I don't think it's possible it wasn't logged so I
am
going through the bucket logs from the 'radosgw-admin log show --object'
side and I found the following::

4604932 {
4604933 "bucket": "noaa-nexrad-l2",
4604934 "time": "2015-11-04 21:29:27.346509Z",
4604935 "time_local": "2015-11-04 15:29:27.346509",
4604936 "remote_addr": "",
4604937 "object_owner": "b05f707271774dbd89674a0736c9406e",
4604938 "user": "b05f707271774dbd89674a0736c9406e",
4604939 "operation": "PUT",

I'd expect a multipart upload completion to be done with a POST, not a
PUT.

Indeed it seems really weird.



4604940 "uri":

"\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959.tar",
4604941 "http_status": "200",
4604942 "error_code": "",
4604943 "bytes_sent": 19,
4604944 "bytes_received": 0,
4604945 "object_size": 0,

Do you see a zero object_size for other multipart uploads?

I think so. I still don't know how to tell for certain if a radosgw object
is a multipart object or not. I think all of the objects in noaa-nexrad-l2
bucket are multipart::

./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-{
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket":
"noaa-nexrad-l2",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time": "2015-10-16
19:49:30.579738Z",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local":
"2015-10-16 14:49:30.579738",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user":
"b05f707271774dbd89674a0736c9406e",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation": "POST",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri":
"\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status": "200",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received": 152,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent":
"Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-}

The objects above (NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar)
pulls down without an issue though. Below is a paste for object
"NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar" which 404's::
http://pastebin.com/Jtw8z7G4

Sadly the log doesn't provide all the input, but I can guess what the
operations were:

  - POST (init multipart upload)
  - PUT (upload part)
  - GET (list parts)
  - POST (complete multipart) <-- took > 57 seconds to process
  - POST (complete multipart)
  - HEAD (stat object)

For some reason the complete multipart operation took too long, which
I think triggered a client retry (either that, or an abort). Then
there were two completions racing (or a complete and abort), which
might have caused the issue we're seeing for some reason. E.g., two
completions might have ended up with the second completion noticing
that it's overwriting an existing object (that we just created),
sending the 'old' object to be garbage collected, when that object's
tail is actually its own tail.



I see two posts one recorded a minute before for this object both with 0
size though. Does this help at all?

Yes, very much

Thanks,
Yehuda


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

2016-01-20 Thread Yehuda Sadeh-Weinraub

We'll need to confirm that this is the actual issue, and then have it
fixed. It would be nice to have some kind of a unitest that reproduces
it.

Yehuda

On Wed, Jan 20, 2016 at 1:34 PM, seapasu...@uchicago.edu
 wrote:
> So is there any way to prevent this from happening going forward? I mean
> ideally this should never be possible, right? Even with a complete object
> that is 0 bytes it should be downloaded as 0 bytes and have a different
> md5sum and not report as 7mb?
>
>
>
> On 1/20/16 1:30 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> On Wed, Jan 20, 2016 at 10:43 AM, seapasu...@uchicago.edu
>>  wrote:
>>>
>>>
>>> On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:

 On Fri, Jan 15, 2016 at 5:04 PM, seapasu...@uchicago.edu
  wrote:
>
> I have looked all over and I do not see any explicit mention of
> "NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959" in the logs nor
> do
> I
> see a timestamp from November 4th although I do see log rotations
> dating
> back to october 15th. I don't think it's possible it wasn't logged so I
> am
> going through the bucket logs from the 'radosgw-admin log show
> --object'
> side and I found the following::
>
> 4604932 {
> 4604933 "bucket": "noaa-nexrad-l2",
> 4604934 "time": "2015-11-04 21:29:27.346509Z",
> 4604935 "time_local": "2015-11-04 15:29:27.346509",
> 4604936 "remote_addr": "",
> 4604937 "object_owner": "b05f707271774dbd89674a0736c9406e",
> 4604938 "user": "b05f707271774dbd89674a0736c9406e",
> 4604939 "operation": "PUT",

 I'd expect a multipart upload completion to be done with a POST, not a
 PUT.
>>>
>>> Indeed it seems really weird.


> 4604940 "uri":
>
>
> "\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959.tar",
> 4604941 "http_status": "200",
> 4604942 "error_code": "",
> 4604943 "bytes_sent": 19,
> 4604944 "bytes_received": 0,
> 4604945 "object_size": 0,

 Do you see a zero object_size for other multipart uploads?
>>>
>>> I think so. I still don't know how to tell for certain if a radosgw
>>> object
>>> is a multipart object or not. I think all of the objects in
>>> noaa-nexrad-l2
>>> bucket are multipart::
>>>
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-{
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket":
>>> "noaa-nexrad-l2",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time": "2015-10-16
>>> 19:49:30.579738Z",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local":
>>> "2015-10-16 14:49:30.579738",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user":
>>> "b05f707271774dbd89674a0736c9406e",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation": "POST",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri":
>>>
>>> "\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status":
>>> "200",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received":
>>> 152,
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent":
>>> "Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
>>> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-}
>>>
>>> The objects above
>>> (NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar)
>>> pulls down without an issue though. Below is a paste for object
>>> "NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar" which 404's::
>>> http://pastebin.com/Jtw8z7G4
>>
>> Sadly the log doesn't provide all the input, but I can guess what the
>> operations were:
>>
>>   - POST (init multipart upload)
>>   - PUT (upload part)
>>   - GET (list parts)
>>   - POST (complete multipart) <-- took > 57 seconds to process
>>   - POST (complete multipart)
>>   - HEAD (stat object)
>>
>> For some reason the complete multipart operation took too long, which
>> I think triggered a client retry (either that, or an abort). Then
>> there were two completions racing (or a complete and abort), which
>> might have caused the issue we're seeing for some reason. E.g., two
>> completions might have ended up with the second completion noticing
>> that it's overwriting an existing object (that we just created),
>> sendi

Re: [ceph-users] RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

2016-01-20 Thread seapasu...@uchicago.edu

I'm working on getting the code they used and trying different timeouts 
in my multipart upload code. Right now I have not created any new 404 
keys though :-(


On 1/20/16 3:44 PM, Yehuda Sadeh-Weinraub wrote:

We'll need to confirm that this is the actual issue, and then have it
fixed. It would be nice to have some kind of a unitest that reproduces
it.

Yehuda

On Wed, Jan 20, 2016 at 1:34 PM, seapasu...@uchicago.edu
 wrote:

So is there any way to prevent this from happening going forward? I mean
ideally this should never be possible, right? Even with a complete object
that is 0 bytes it should be downloaded as 0 bytes and have a different
md5sum and not report as 7mb?



On 1/20/16 1:30 PM, Yehuda Sadeh-Weinraub wrote:

On Wed, Jan 20, 2016 at 10:43 AM, seapasu...@uchicago.edu
 wrote:


On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:

On Fri, Jan 15, 2016 at 5:04 PM, seapasu...@uchicago.edu
 wrote:

I have looked all over and I do not see any explicit mention of
"NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959" in the logs nor
do
I
see a timestamp from November 4th although I do see log rotations
dating
back to october 15th. I don't think it's possible it wasn't logged so I
am
going through the bucket logs from the 'radosgw-admin log show
--object'
side and I found the following::

4604932 {
4604933 "bucket": "noaa-nexrad-l2",
4604934 "time": "2015-11-04 21:29:27.346509Z",
4604935 "time_local": "2015-11-04 15:29:27.346509",
4604936 "remote_addr": "",
4604937 "object_owner": "b05f707271774dbd89674a0736c9406e",
4604938 "user": "b05f707271774dbd89674a0736c9406e",
4604939 "operation": "PUT",

I'd expect a multipart upload completion to be done with a POST, not a
PUT.

Indeed it seems really weird.



4604940 "uri":


"\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959.tar",
4604941 "http_status": "200",
4604942 "error_code": "",
4604943 "bytes_sent": 19,
4604944 "bytes_received": 0,
4604945 "object_size": 0,

Do you see a zero object_size for other multipart uploads?

I think so. I still don't know how to tell for certain if a radosgw
object
is a multipart object or not. I think all of the objects in
noaa-nexrad-l2
bucket are multipart::

./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-{
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket":
"noaa-nexrad-l2",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time": "2015-10-16
19:49:30.579738Z",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local":
"2015-10-16 14:49:30.579738",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user":
"b05f707271774dbd89674a0736c9406e",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation": "POST",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri":

"\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status":
"200",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received":
152,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent":
"Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-}

The objects above
(NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar)
pulls down without an issue though. Below is a paste for object
"NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar" which 404's::
http://pastebin.com/Jtw8z7G4

Sadly the log doesn't provide all the input, but I can guess what the
operations were:

   - POST (init multipart upload)
   - PUT (upload part)
   - GET (list parts)
   - POST (complete multipart) <-- took > 57 seconds to process
   - POST (complete multipart)
   - HEAD (stat object)

For some reason the complete multipart operation took too long, which
I think triggered a client retry (either that, or an abort). Then
there were two completions racing (or a complete and abort), which
might have caused the issue we're seeing for some reason. E.g., two
completions might have ended up with the second completion noticing
that it's overwriting an existing object (that we just created),
sending the 'old' object to be garbage collected, when that object's
tail is actually its own tail.



I see two posts one recorded a minute before for this object both with 0
size though. Does this help at all?

Yes, very much

Thanks,
Yehuda

Re: [ceph-users] RGW -- 404 on keys in bucket.list() thousands of multipart ids listed as well.

2016-01-20 Thread Yehuda Sadeh-Weinraub

Keep in mind that if the problem is that the tail is being sent to
garbage collection, you'll only see the 404 after a few hours. A
shorter way to check it would be by listing the gc entries (with
--include-all).

Yehuda

On Wed, Jan 20, 2016 at 1:52 PM, seapasu...@uchicago.edu
 wrote:
> I'm working on getting the code they used and trying different timeouts in
> my multipart upload code. Right now I have not created any new 404 keys
> though :-(
>
>
> On 1/20/16 3:44 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> We'll need to confirm that this is the actual issue, and then have it
>> fixed. It would be nice to have some kind of a unitest that reproduces
>> it.
>>
>> Yehuda
>>
>> On Wed, Jan 20, 2016 at 1:34 PM, seapasu...@uchicago.edu
>>  wrote:
>>>
>>> So is there any way to prevent this from happening going forward? I mean
>>> ideally this should never be possible, right? Even with a complete object
>>> that is 0 bytes it should be downloaded as 0 bytes and have a different
>>> md5sum and not report as 7mb?
>>>
>>>
>>>
>>> On 1/20/16 1:30 PM, Yehuda Sadeh-Weinraub wrote:

 On Wed, Jan 20, 2016 at 10:43 AM, seapasu...@uchicago.edu
  wrote:
>
>
> On 1/19/16 4:00 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> On Fri, Jan 15, 2016 at 5:04 PM, seapasu...@uchicago.edu
>>  wrote:
>>>
>>> I have looked all over and I do not see any explicit mention of
>>> "NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959" in the logs
>>> nor
>>> do
>>> I
>>> see a timestamp from November 4th although I do see log rotations
>>> dating
>>> back to october 15th. I don't think it's possible it wasn't logged so
>>> I
>>> am
>>> going through the bucket logs from the 'radosgw-admin log show
>>> --object'
>>> side and I found the following::
>>>
>>> 4604932 {
>>> 4604933 "bucket": "noaa-nexrad-l2",
>>> 4604934 "time": "2015-11-04 21:29:27.346509Z",
>>> 4604935 "time_local": "2015-11-04 15:29:27.346509",
>>> 4604936 "remote_addr": "",
>>> 4604937 "object_owner":
>>> "b05f707271774dbd89674a0736c9406e",
>>> 4604938 "user": "b05f707271774dbd89674a0736c9406e",
>>> 4604939 "operation": "PUT",
>>
>> I'd expect a multipart upload completion to be done with a POST, not a
>> PUT.
>
> Indeed it seems really weird.
>>
>>
>>> 4604940 "uri":
>>>
>>>
>>>
>>> "\/noaa-nexrad-l2\/2015\/01\/01\/PAKC\/NWS_NEXRAD_NXL2DP_PAKC_2015010111_20150101115959.tar",
>>> 4604941 "http_status": "200",
>>> 4604942 "error_code": "",
>>> 4604943 "bytes_sent": 19,
>>> 4604944 "bytes_received": 0,
>>> 4604945 "object_size": 0,
>>
>> Do you see a zero object_size for other multipart uploads?
>
> I think so. I still don't know how to tell for certain if a radosgw
> object
> is a multipart object or not. I think all of the objects in
> noaa-nexrad-l2
> bucket are multipart::
>
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-{
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bucket":
> "noaa-nexrad-l2",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time":
> "2015-10-16
> 19:49:30.579738Z",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "time_local":
> "2015-10-16 14:49:30.579738",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "remote_addr": "",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user":
> "b05f707271774dbd89674a0736c9406e",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out: "operation":
> "POST",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "uri":
>
>
> "\/noaa-nexrad-l2\/2015\/01\/13\/KGRK\/NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "http_status":
> "200",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "error_code": "",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_sent": 331,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "bytes_received":
> 152,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "object_size": 0,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "total_time": 0,
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "user_agent":
> "Boto\/2.38.0 Python\/2.7.7 Linux\/2.6.32-573.7.1.el6.x86_64",
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out- "referrer": ""
> ./2015-10-16-14-default.384153.1-noaa-nexrad-l2.out-}
>
> The objects above
> (NWS_NEXRAD_NXL2DP_KGRK_2015011304_20150113045959.tar)
> pulls down without an issue though. Below is a paste for object
> "NWS_NEXRAD_NXL2DP_KVBX_2015022516_20150225165959.tar" which
> 404'

Re: [ceph-users] Ceph monitors 100% full filesystem, refusing start

2016-01-20 Thread Zoltan Arnold Nagy

Wouldn’t this be the same operation as growing the number of monitors from 
let’s say 3 to 5 in an already running, production cluster, which AFAIK is 
supported?

Just in this case it’s not 3->5 but 1->X :)

> On 20 Jan 2016, at 22:04, Wido den Hollander  wrote:
> 
> On 01/20/2016 08:01 PM, Zoltan Arnold Nagy wrote:
>> Wouldn’t actually blowing away the other monitors then recreating them
>> from scratch solve the issue?
>> 
>> Never done this, just thinking out loud. It would grab the osdmap and
>> everything from the other monitor and form a quorum, wouldn’t it?
>> 
> 
> Nope, those monitors will not have any historical OSDMaps which will be
> required by OSDs which need to catch up with the cluster.
> 
> It might be possible technically by hacking a lot of stuff, but that
> won't be easy.
> 
> I'm still busy with this btw. The monitors are in a electing state since
> 2 monitors are still synchronizing and one won't boot anymore :(
> 
>>> On 20 Jan 2016, at 16:26, Wido den Hollander >> > wrote:
>>> 
>>> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
 Hi Wido,
 
 So one out of the 5 monitors are running fine then? Did that have
 more space for it’s leveldb?
 
>>> 
>>> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
>>> /var/log I was able to start it.
>>> 
>>> It compacted the levelDB database and is now on 1% disk usage.
>>> 
>>> Looking at the ceph_mon.cc code:
>>> 
>>> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
>>> 
>>> Setting mon_data_avail_crit to 0 does not work since 100% full is equal
>>> to 0% free..
>>> 
>>> There is ~300M free on the other 4 monitors. I just can't start the mon
>>> and tell it to compact.
>>> 
>>> Lessons learned here though, always make sure you have some additional
>>> space you can clear when you need it.
>>> 
> On 20 Jan 2016, at 16:15, Wido den Hollander  > wrote:
> 
> Hello,
> 
> I have an issue with a (not in production!) Ceph cluster which I'm
> trying to resolve.
> 
> On Friday the network links between the racks failed and this caused all
> monitors to loose connection.
> 
> Their leveldb stores kept growing and they are currently 100% full. They
> all have a few hunderd MB left.
> 
> Starting the 'compact on start' doesn't work since the FS is 100%
> full.error: monitor data filesystem reached concerning levels of
> available storage space (available: 0% 238 MB)
> you may adjust 'mon data avail crit' to a lower value to make this go
> away (default: 0%)
> 
> On of the 5 monitors is now running but that's not enough.
> 
> Any ideas how to compact this leveldb? I can't free up any more space
> right now on these systems. Getting bigger disks in is also going to
> take a lot of time.
> 
> Any tools outside the monitors to use here?
> 
> Keep in mind, this is a pre-production cluster. We would like to keep
> the cluster and fix this as a good exercise of stuff which could go
> wrong. Dangerous tools are allowed!
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
 
>>> 
>>> 
>>> -- 
>>> Wido den Hollander
>>> 42on B.V.
>>> Ceph trainer and consultant
>>> 
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
>> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph fuse closing stale session while still operable

2016-01-20 Thread Gregory Farnum

On Wed, Jan 20, 2016 at 6:58 AM, Oliver Dzombic  wrote:
> Hi,
>
> i am testing on centos 6 x64 minimal install.
>
> i am mounting successfully:
>
> ceph-fuse -m 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789,10.0.0.4:6789
> /ceph-storage/
>
>
> [root@cn201 log]# df
> Filesystem1K-blocksUsed   Available Use% Mounted on
> /dev/sda1  74454192 122864469436748   2% /
> tmpfs  16433588   016433588   0% /dev/shm
> ceph-fuse  104468783104 55774867456 48693915648  54% /ceph-storage
>
>
> Its all fine.
>
> Then i start a (bigger) write:
>
> dd if=/dev/zero bs=256M count=16 of=/ceph-storage/test/dd1
>
> After a second it reaches:
>
> [root@cn201 test]# ls -la /ceph-storage/test/dd1
> -rw-r--r-- 1 root root 104726528 Jan 20 15:34 /ceph-storage/test/dd1
>
> and remains there, no byte further.
>
> #ps ax shows:
>
>  1573 pts/0S+ 0:00 dd if=/dev/zero bs=256M count=16
> of=/ceph-storage/test/dd1
>
>
> ---
>
> The Kernellog just shows:
>
> fuse init (API version 7.14)
>
> after the mount.
>
>
> ---
>
> on a ceph clusternode i can see:
>
>
> [root@ceph2 ceph]# cat ceph-mds.ceph2.log
> 2016-01-20 15:34:07.728239 7f3832ddb700  0 log_channel(cluster) log
> [INF] : closing stale session client.21176728 10.0.0.91:0/1635 after
> 301.302291
>
>
> But still i can work with the mount. df, ls, even touch  works
> perfectly. Just writing bigger amounts of data somehow freeze.

What's the output of "ceph -s"? What this sounds like is that you
aren't able to flush any data out to RADOS and so it's blocking on the
dirty page limits.

>
>
>
> I had this issue already with my last tests with
>
> Centos 7
> Debian 7
> Debian 8
> Ubuntu 14
>
> All in x64
>
> 
>
>
> So as i think that this is no general bug, i assume i have a setup mistake.
>
> So this is my setup for the current setup with centos 6:
>
> 1. centos netinstall x64 minimal
> 2. yum update -y
> 3.
> rpm -i
> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-release-1-1.el6.noarch.rpm
>
> rpm -i
> http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
>
> yum -y install yum-plugin-priorities
>
> sed -i -e "s/enabled=1/enabled=1\npriority=1/g" /etc/yum.repos.d/ceph.repo
>
> yum -y install ceph-fuse
>
> 4. deactivate selinux
> 5. network config:
>
> [root@cn201 test]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
> DEVICE=eth1
> HWADDR=0C:C4:7A:16:EE:3F
> TYPE=Ethernet
> UUID=19df403f-c1f2-4a39-a458-5596af108ca6
> BOOTPROTO=none
> ONBOOT=yes
> IPADDR0="10.0.0.91"
> PREFIX0="24"
> MTU=9000
>
>
> 6. Copy ceph.client.admin.keyring to /etc/ceph/

What are the contents of this file? In particular, does it have access
permissions on both the mds and the OSD? (Which ones?)
-Greg

>
> 7. mountint:
>
> #ceph-fuse -m 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789,10.0.0.4:6789
> /ceph-storage
>
> 8. testing:
>
> #dd if=/dev/zero bs=256M count=16 of=/ceph-storage/test/dd1
>
>
> ---
>
>
> So before i switch now all in debug mode:
>
> Anyone any idea ? At least theoretically all fine and should work ?
>
> Thank you !
>
>
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph fuse closing stale session while still operable

2016-01-20 Thread Oliver Dzombic

Hi Greg,

thank you for your time!

#ceph-s

   cluster 
 health HEALTH_WARN
62 requests are blocked > 32 sec
noscrub,nodeep-scrub flag(s) set
 monmap e9: 4 mons at
{ceph1=10.0.0.1:6789/0,ceph2=10.0.0.2:6789/0,ceph3=10.0.0.3:6789/0,ceph4=10.0.0.4:6789/0}
election epoch 4526, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
 mdsmap e62: 1/1/1 up {0=ceph4=up:active}, 2 up:standby
 osdmap e12522: 18 osds: 18 up, 18 in
flags noscrub,nodeep-scrub
  pgmap v9518008: 1940 pgs, 6 pools, 20554 GB data, 5191 kobjects
53232 GB used, 46396 GB / 99629 GB avail
1940 active+clean
  client io 56771 kB/s rd, 18844 kB/s wr, 2037 op/s


The warn comes from the noscrub flags ( neccessary currently because the
iscsi rbd reacts >very< bad on it )

-

[root@cn201 ~]# cat /etc/ceph/ceph.client.admin.keyring
[client.admin]
key = mysuperkey123


-


The cluster is very active and working perfectly.

But the other access goes over the rbd kernel module and is working fine.

Just the access via ceph-fuse is causing trouble.

Is it possible, that its a problem with flushing pages, just concerning
ceph-fuse and not the rest ?

How can i check that at best ?

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph fuse closing stale session while still operable

2016-01-20 Thread Gregory Farnum

On Wed, Jan 20, 2016 at 4:03 PM, Oliver Dzombic  wrote:
> Hi Greg,
>
> thank you for your time!
>
> #ceph-s
>
>cluster 
>  health HEALTH_WARN
> 62 requests are blocked > 32 sec
> noscrub,nodeep-scrub flag(s) set
>  monmap e9: 4 mons at
> {ceph1=10.0.0.1:6789/0,ceph2=10.0.0.2:6789/0,ceph3=10.0.0.3:6789/0,ceph4=10.0.0.4:6789/0}
> election epoch 4526, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
>  mdsmap e62: 1/1/1 up {0=ceph4=up:active}, 2 up:standby
>  osdmap e12522: 18 osds: 18 up, 18 in
> flags noscrub,nodeep-scrub
>   pgmap v9518008: 1940 pgs, 6 pools, 20554 GB data, 5191 kobjects
> 53232 GB used, 46396 GB / 99629 GB avail
> 1940 active+clean
>   client io 56771 kB/s rd, 18844 kB/s wr, 2037 op/s
>
>
> The warn comes from the noscrub flags ( neccessary currently because the
> iscsi rbd reacts >very< bad on it )
>
> -
>
> [root@cn201 ~]# cat /etc/ceph/ceph.client.admin.keyring
> [client.admin]
> key = mysuperkey123

Okay, when you run "ceph auth list" what security capabilities does it
say the client.admin key has?

>
>
> -
>
>
> The cluster is very active and working perfectly.
>
> But the other access goes over the rbd kernel module and is working fine.
>
> Just the access via ceph-fuse is causing trouble.
>
> Is it possible, that its a problem with flushing pages, just concerning
> ceph-fuse and not the rest ?
>
> How can i check that at best ?

Yes, that's a possibility. My best guess is that you somehow set it up
in a way that doesn't have permission to write to the pool being used
for the CephFS storage.

>
> Thank you !
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-20 Thread Francois Lafont

Hi,

On 19/01/2016 07:24, Adam Tygart wrote:
> It appears that with --apparent-size, du adds the "size" of the
> directories to the total as well. On most filesystems this is the
> block size, or the amount of metadata space the directory is using. On
> CephFS, this size is fabricated to be the size sum of all sub-files.
> i.e. a cheap/free 'du -sh $folder'

Ah ok, interesting. I have tested and I have noticed however that size
of a directory is not updated immediately. For instance, if I change
the size of the regular file in a directory (of cephfs) the size of the
size doesn't change immediately after.

> $ stat /homes/mozes/tmp/sbatten
>   File: '/homes/mozes/tmp/sbatten'
>   Size: 138286  Blocks: 0  IO Block: 65536  directory
> Device: 0h/0d   Inode: 1099523094368  Links: 1
> Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
> Access: 2016-01-19 00:12:23.331201000 -0600
> Modify: 2015-10-14 13:38:01.098843320 -0500
> Change: 2015-10-14 13:38:01.098843320 -0500
>  Birth: -
> $ stat /tmp/sbatten/
>   File: '/tmp/sbatten/'
>   Size: 4096Blocks: 8  IO Block: 4096   directory
> Device: 803h/2051d  Inode: 9568257 Links: 2
> Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
> Access: 2016-01-19 00:12:23.331201000 -0600
> Modify: 2015-10-14 13:38:01.098843320 -0500
> Change: 2016-01-19 00:17:29.658902081 -0600
>  Birth: -
> 
> $ du -s --apparent-size -B1 /homes/mozes/tmp/sbatten
> 276572  /homes/mozes/tmp/sbatten
> $ du -s -B1 /homes/mozes/tmp/sbatten
> 147456  /homes/mozes/tmp/sbatten
> 
> $ du -s -B1 /tmp/sbatten
> 225280  /tmp/sbatten
> $ du -s --apparent-size -B1 /tmp/sbatten
> 142382  /tmp/sbatten
> 
> Notice how the apparent-size version is *exactly* the Size from the
> stat + the size from the "proper" du?

Err... exactly? Are you sure?

138286 + 147456 = 285742 which is != 276572, no?
Anyway thx for your help Adam.


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-20 Thread Francois Lafont

On 21/01/2016 03:40, Francois Lafont wrote:

> Ah ok, interesting. I have tested and I have noticed however that size
> of a directory is not updated immediately. For instance, if I change
> the size of the regular file in a directory (of cephfs) the size of the
> size doesn't change immediately after.
  

Misprint. The "size of the directory" of course.
   ^


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-20 Thread Gregory Farnum

On Wed, Jan 20, 2016 at 6:40 PM, Francois Lafont  wrote:
> Hi,
>
> On 19/01/2016 07:24, Adam Tygart wrote:
>> It appears that with --apparent-size, du adds the "size" of the
>> directories to the total as well. On most filesystems this is the
>> block size, or the amount of metadata space the directory is using. On
>> CephFS, this size is fabricated to be the size sum of all sub-files.
>> i.e. a cheap/free 'du -sh $folder'
>
> Ah ok, interesting. I have tested and I have noticed however that size
> of a directory is not updated immediately. For instance, if I change
> the size of the regular file in a directory (of cephfs) the size of the
> size doesn't change immediately after.

It's updated lazily so it's not instantaneous, but it should be pretty
fast. Probably within 30 seconds, and usually a lot less.
-Greg

>
>> $ stat /homes/mozes/tmp/sbatten
>>   File: '/homes/mozes/tmp/sbatten'
>>   Size: 138286  Blocks: 0  IO Block: 65536  directory
>> Device: 0h/0d   Inode: 1099523094368  Links: 1
>> Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
>> Access: 2016-01-19 00:12:23.331201000 -0600
>> Modify: 2015-10-14 13:38:01.098843320 -0500
>> Change: 2015-10-14 13:38:01.098843320 -0500
>>  Birth: -
>> $ stat /tmp/sbatten/
>>   File: '/tmp/sbatten/'
>>   Size: 4096Blocks: 8  IO Block: 4096   directory
>> Device: 803h/2051d  Inode: 9568257 Links: 2
>> Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
>> Access: 2016-01-19 00:12:23.331201000 -0600
>> Modify: 2015-10-14 13:38:01.098843320 -0500
>> Change: 2016-01-19 00:17:29.658902081 -0600
>>  Birth: -
>>
>> $ du -s --apparent-size -B1 /homes/mozes/tmp/sbatten
>> 276572  /homes/mozes/tmp/sbatten
>> $ du -s -B1 /homes/mozes/tmp/sbatten
>> 147456  /homes/mozes/tmp/sbatten
>>
>> $ du -s -B1 /tmp/sbatten
>> 225280  /tmp/sbatten
>> $ du -s --apparent-size -B1 /tmp/sbatten
>> 142382  /tmp/sbatten
>>
>> Notice how the apparent-size version is *exactly* the Size from the
>> stat + the size from the "proper" du?
>
> Err... exactly? Are you sure?
>
> 138286 + 147456 = 285742 which is != 276572, no?
> Anyway thx for your help Adam.
>
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CentOS 7 iscsi gateway using lrbd

2016-01-20 Thread Mike Christie

On 01/20/2016 06:07 AM, Nick Fisk wrote:
> Thanks for your input Mike, a couple of questions if I may
> 
> 1. Are you saying that this rbd backing store is not in mainline and is only 
> in SUSE kernels? Ie can I use this lrbd on Debian/Ubuntu/CentOS?

The target_core_rbd backing store is not upstream and only in SUSE kernels.

lrbd is the management tool that basically distributes the configuration
info to the nodes you want to run LIO on. In that README you see it uses
the target_core_rbd module by default, but last I looked there is code
to support iblock too. So you should be able to use this with other
distros that do not have target_core_rbd.

When I was done porting my code to a iblock based approach I was going
to test out the lrbd iblock support and fix it up if it needed anything.

> 2. Does this have any positive effect on the abort/reset death loop a number 
> of us were seeing when using LIO+krbd and ESXi?

The old code and my new approach does not really help. However, on
Monday, Ilya and I were talking about this problem, and he gave me some
hints on how to add code to cancel/cleanup commands so we will be able
to handle aborts/resets properly and so we will not fall into that problem.


> 3. Can you still use something like bcache over the krbd?

Not initially. I had been doing active/active across nodes by default,
so you cannot use bcache and krbd as is like that.




> 
> 
> 
>> -Original Message-
>> From: Mike Christie [mailto:mchri...@redhat.com]
>> Sent: 19 January 2016 21:34
>> To: Василий Ангапов ; Ilya Dryomov
>> 
>> Cc: Nick Fisk ; Tyler Bishop
>> ; Dominik Zalewski
>> ; ceph-users 
>> Subject: Re: [ceph-users] CentOS 7 iscsi gateway using lrbd
>>
>> Everyone is right - sort of :)
>>
>> It is that target_core_rbd module that I made that was rejected upstream,
>> along with modifications from SUSE which added persistent reservations
>> support. I also made some modifications to rbd so target_core_rbd and krbd
>> could share code. target_core_rbd uses rbd like a lib. And it is also
>> modifications to the targetcli related tool and libs, so you can use them to
>> control the new rbd backend. SUSE's lrbd then handles setup/management
>> of across multiple targets/gatways.
>>
>> I was going to modify targetcli more and have the user just pass in the rbd
>> info there, but did not get finished. That is why in that suse stuff you 
>> still
>> make the krbd device like normal. You then pass that to the target_core_rbd
>> module with targetcli and that is how that module knows about the rbd
>> device.
>>
>> The target_core_rbd module was rejected upstream, so I stopped
>> development and am working on the approach suggested by those
>> reviewers which instead of going from lio->target_core_rbd->krbd goes
>> lio->target_core_iblock->linux block layer->krbd. With this approach you
>> just use the normal old iblock driver and krbd and then I am modifying them
>> to just work and do the right thing.
>>
>>
>> On 01/19/2016 05:45 AM, Василий Ангапов wrote:
>>> So is it a different approach that was used here by Mike Christie:
>>> http://www.spinics.net/lists/target-devel/msg10330.html ?
>>> It seems to be a confusion because it also implements target_core_rbd
>>> module. Or not?
>>>
>>> 2016-01-19 18:01 GMT+08:00 Ilya Dryomov :
 On Tue, Jan 19, 2016 at 10:34 AM, Nick Fisk  wrote:
> But interestingly enough, if you look down to where they run the
>> targetcli ls, it shows a RBD backing store.
>
> Maybe it's using the krbd driver to actually do the Ceph side of the
>> communication, but lio plugs into this rather than just talking to a dumb 
>> block
>> device???

 It does use krbd driver.

 Thanks,

 Ilya
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph scale testing

2016-01-20 Thread Somnath Roy

Hi,
Here is the copy of the ppt I presented in today's performance meeting..

https://docs.google.com/presentation/d/1j4Lcb9fx0OY7eQlQ_iUI6TPVJ6t_orZWKJyhz0S_3ic/edit?usp=sharing

Thanks & Regards
Somnath
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph scale testing

2016-01-20 Thread Alexandre DERUMIER

Thanks Somnath !

- Mail original -
De: "Somnath Roy" 
À: "ceph-devel" , "ceph-users" 

Envoyé: Jeudi 21 Janvier 2016 05:03:59
Objet: Ceph scale testing

Hi, 
Here is the copy of the ppt I presented in today's performance meeting.. 

https://docs.google.com/presentation/d/1j4Lcb9fx0OY7eQlQ_iUI6TPVJ6t_orZWKJyhz0S_3ic/edit?usp=sharing
 

Thanks & Regards 
Somnath 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

54 matches

Mail list logo