[ceph-users] Re: RGW access logs with bucket name

2023-10-28 Thread Dan van der Ster
Hi Boris,

I found that you need to use debug_rgw=10 to see the bucket name :-/

e.g.
2023-10-28T19:55:42.288+ 7f34dde06700 10 req 3268931155513085118
0.0s s->object=... s->bucket=xyz-bucket-123

Did you find a more convenient way in the meantime? I think we should
log bucket name at level 1.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

Try our Ceph Analyzer: https://analyzer.clyso.com

On Thu, Mar 30, 2023 at 4:15 AM Boris Behrens  wrote:
>
> Sadly not.
> I only see the the path/query of a request, but not the hostname.
> So when a bucket is accessed via hostname (https://bucket.TLD/object?query)
> I only see the object and the query (GET /object?query).
> When a bucket is accessed bia path (https://TLD/bucket/object?query) I can
> see also the bucket in the log (GET bucket/object?query)
>
> Am Do., 30. März 2023 um 12:58 Uhr schrieb Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com>:
>
> > It has the full url begins with the bucket name in the beast logs http
> > requests, hasn’t it?
> >
> > Istvan Szabo
> > Staff Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2023. Mar 30., at 17:44, Boris Behrens  wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> >
> > Bringing up that topic again:
> > is it possible to log the bucket name in the rgw client logs?
> >
> > currently I am only to know the bucket name when someone access the bucket
> > via https://TLD/bucket/object instead of https://bucket.TLD/object.
> >
> > Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens :
> >
> > Hi,
> >
> > I am looking forward to move our logs from
> >
> > /var/log/ceph/ceph-client...log to our logaggregator.
> >
> >
> > Is there a way to have the bucket name in the log file?
> >
> >
> > Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> >
> > this.
> >
> >
> > Cheers and happy new year
> >
> > Boris
> >
> >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groüen Saal.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > --
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by copyright
> > or other legal rules. If you have received it by mistake please let us know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-28 Thread Anthony D'Atri
Well said, Herr Kraftmayer.

- aad

> On Oct 28, 2023, at 4:22 AM, Joachim Kraftmayer - ceph ambassador 
>  wrote:
> 
> Hi,
> 
> I know similar requirements, the motivation and the need behind them.
> We have chosen a clear approach to this, which also does not make the whole 
> setup too complicated to operate.
> 1.) Everything that doesn't require strong consistency we do with other 
> tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies with 
> high IOPs and low latencies.
> 
> 2.) Everything that requires high data security, strong consistency and 
> higher failure domains as host we do with Ceph.
> 
> Joachim
> 
> ___
> ceph ambassador DACH
> ceph consultant since 2012
> 
> Clyso GmbH - Premier Ceph Foundation Member
> 
> https://www.clyso.com/
> 
> Am 27.10.23 um 17:58 schrieb Anthony D'Atri:
>> Ceph is all about strong consistency and data durability.  There can also be 
>> a distinction between performance of the cluster in aggregate vs a single 
>> client, especially in a virtualization scenario where to avoid the 
>> noisy-neighbor dynamic you deliberately throttle iops and bandwidth per 
>> client.
>> 
>>> For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
>>> capable of writing about 8GiB/s, which is about 64GBit/s.
>> Written how, though?  Benchmarks sometimes are written with 100% sequential 
>> workloads, top-SKU CPUs that mortals can't afford, and especially with a 
>> queue depth of like 256.
>> 
>> With most Ceph deployments, the IO a given drive experiences is often pretty 
>> much random and with lower QD.  And depending on the drive, significant read 
>> traffic may impact write bandwidth to a degree.  At . Mountpoint 
>> (Vancouver BC 2018) someone gave a presentation about the difficulties 
>> saturating NVMe bandwidth.
>> 
>>> Now considering the situation that you have 5 nodes each has 4 of that 
>>> drives,
>>> will make all small and mid-sized companies to go bankrupt ;-) only from 
>>> buying the corresponding networking switches.
>> Depending where you get your components...
>> 
>> * You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes 
>> "read intensive" (~1DWPD) (or less, sometimes) are plenty.  But please 
>> please please stick with real enterprise-class drives.
>> 
>> * Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
>> get SSDs elsewhere for half of what they cost from your chassis manufacturer.
>> 
>>>   But the servers hardware is still a simplistic commodity hardware which 
>>> can saturate the given any given commodity network hardware easily.
>>> If I want to be able to use full 64GBit/s I would require at least 
>>> 100GBit/s networking or tons of trunked ports and cabaling with lower 
>>> bandwidth switches.
>> Throughput and latency are different things, though.  Also, are you assuming 
>> here the traditional topology of separate public and 
>> cluster/private/replication networks?  With modern networking (and Ceph 
>> releases) that is often overkill and you can leave out the replication 
>> network.
>> 
>> Also, would your clients have the same networking provisioned?  If you're
>> 
>>>   If we now also consider distributing the nodes over racks, building on 
>>> same location or distributed datacenters, the costs will be even more 
>>> painfull.
>> Don't you already have multiple racks?  They don't need to be dedicated only 
>> to Ceph.
>> 
>>> The ceph commit requirement will be 2 copies on different OSDs (comparable 
>>> to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable 
>>> to a RAID with multiple disk redudancy)
>> Not entirely comparable, but the distinctions mostly don't matter here.
>> 
>>> In all our tests so far, we could not control the behavior of how ceph is 
>>> persisting this 2 copies. It will always try to persist it somehow over the 
>>> network.
>>> Q1: Is this behavior mandatory?
>> It's a question of how important the data is, and how bad it would be to 
>> lose some.
>> 
>>>   Our common workload, and afaik nearly all webservice based applications 
>>> are:
>>> - a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
>>> - and probably mostly 1write to 4read or even 1:6 ratio on utilizing the 
>>> cluster
>> QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though 
>> these days if you shop smart you can get TLC for close the same cost.  Won't 
>> always be true though, and you can't get a 60TB TLC SKU ;)
>> 
>>> Hope I could explain the situation here well enough.
>>> Now assuming my ideal world with ceph:
>>> if ceph would do:
>>> 1. commit 2 copies to local drives to the node there ceph client is 
>>> connected to
>>> 2. after commit sync (optimized/queued) the data over the network to 
>>> fulfill the common needs of ceph storage with 4 copies
>> You could I think craft a CRUSH rule to do that.  Default for replicated 
>> pools FWIW is 3 copies not 4.

[ceph-users] Re: Ceph - Error ERANGE: (34) Numerical result out of range

2023-10-28 Thread Eugen Block
So this is a new host (you didn't provide the osd tree)? In that case  
I would compare the ceph.conf files between a working and this failing  
host, and paste it here (mask sensitive data). It looks like the  
connection to the MONs is successful though, and "ceph-volume create"  
worked as well. You could try to avoid a crush update on start:


[osd]
osd crush update on start = false

Or you could also try to manually assign the location:

[osd.301]
osd crush location = "root=ssd"

Try one option at a time to see which one works (if at all).


Zitat von Pardhiv Karri :


Hi Eugen,

Thank you for the reply. For some reason I'm not getting individual reply
but only the digest. Below is the ceph -s output (renamed hostnames) and
the command I am using to create a bluestore OSD. It should create a OSD
with its hostname and then the OSD should be up but it is not creating the
host and just a rogue OSD which is down.

[root@hbmon1 ~]# ceph -s
  cluster:
id: f1579737-d2c9-49ab-a6fa-8ca952488120
health: HEALTH_WARN
116896/167701779 objects misplaced (0.070%)

  services:
mon: 3 daemons, quorum hbmon1,hbmon2,hbmon3
mgr: hbmon2(active), standbys: hbmon1, hbmon3
osd: 721 osds: 717 up, 716 in; 60 remapped pgs
rgw: 1 daemon active

  data:
pools:   13 pools, 32384 pgs
objects: 55.90M objects, 324TiB
usage:   973TiB used, 331TiB / 1.27PiB avail
pgs: 116896/167701779 objects misplaced (0.070%)
 32294 active+clean
 59active+remapped+backfill_wait
 27active+clean+scrubbing+deep
 3 active+clean+scrubbing
 1 active+remapped+backfilling

  io:
client:   237MiB/s rd, 635MiB/s wr, 10.66kop/s rd, 6.98kop/s wr
recovery: 12.9MiB/s, 1objects/s

 [root@hbmon1 ~]#


Command used to create OSD, "ceph-volume lvm create --data /dev/sda"



Debug log output of OSD creation command.

 [root@dra1361 ~]# ceph-volume lvm create --data /dev/sda
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
21e9a327-ada5-4734-ab5d-7be333d4f3cf
Running command: vgcreate --force --yes
ceph-81236ab2-f6e0-4cc3-9815-95c8dd16c6ef /dev/sda
 stdout: Physical volume "/dev/sda" successfully created.
 stdout: Volume group "ceph-81236ab2-f6e0-4cc3-9815-95c8dd16c6ef"
successfully created
Running command: lvcreate --yes -l 100%FREE -n
osd-block-21e9a327-ada5-4734-ab5d-7be333d4f3cf
ceph-81236ab2-f6e0-4cc3-9815-95c8dd16c6ef
 stdout: Logical volume "osd-block-21e9a327-ada5-4734-ab5d-7be333d4f3cf"
created.
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-301
--> Absolute path not found for executable: restorecon
--> Ensure $PATH environment variable contains common executable locations
Running command: chown -h ceph:ceph
/dev/ceph-81236ab2-f6e0-4cc3-9815-95c8dd16c6ef/osd-block-21e9a327-ada5-4734-ab5d-7be333d4f3cf
Running command: chown -R ceph:ceph /dev/dm-0
Running command: ln -s
/dev/ceph-81236ab2-f6e0-4cc3-9815-95c8dd16c6ef/osd-block-21e9a327-ada5-4734-ab5d-7be333d4f3cf
/var/lib/ceph/osd/ceph-301/block
Running command: ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
/var/lib/ceph/osd/ceph-301/activate.monmap
 stderr: 2023-10-27 19:48:57.789631 7ff36a340700  2 Event(0x7ff3640e2950
nevent=5000 time_id=1).set_owner idx=0 owner=140683435575040
2023-10-27 19:48:57.789713 7ff369b3f700  2 Event(0x7ff36410f670 nevent=5000
time_id=1).set_owner idx=1 owner=140683427182336
2023-10-27 19:48:57.789771 7ff36933e700  2 Event(0x7ff36413c4e0 nevent=5000
time_id=1).set_owner idx=2 owner=140683418789632
 stderr: 2023-10-27 19:48:57.790044 7ff36c135700  1  Processor -- start
2023-10-27 19:48:57.790100 7ff36c135700  1 -- - start start
2023-10-27 19:48:57.790352 7ff36c135700  1 -- - --> 10.51.228.32:6789/0 --
auth(proto 0 38 bytes epoch 0) v1 -- 0x7ff364175e70 con 0
2023-10-27 19:48:57.790368 7ff36c135700  1 -- - --> 10.51.228.33:6789/0 --
auth(proto 0 38 bytes epoch 0) v1 -- 0x7ff3641762b0 con 0
 stderr: 2023-10-27 19:48:57.791313 7ff369b3f700  1 --
10.51.228.213:0/2678799534 learned_addr learned my addr
10.51.228.213:0/2678799534
 stderr: 2023-10-27 19:48:57.791740 7ff36933e700  2 --
10.51.228.213:0/2678799534 >> 10.51.228.32:6789/0 conn(0x7ff36417f4e0 :-1
s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got
newly_acked_seq 0 vs out_seq 0
2023-10-27 19:48:57.791763 7ff369b3f700  2 -- 10.51.228.213:0/2678799534 >>
10.51.228.33:6789/0 conn(0x7ff36417be80 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ
pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
 stderr: 2023-10-27 19:48:57.792414 7ff353fff700  1 --
10.51.228.213:0/2678799534 <== mon.1 10.51.228.33:6789/0 1  mon_map
magic: 0 v1  442+0+0 (171445244 0 0) 0x7ff360001690 con 0x7ff36417be80

[ceph-users] Re: Problem with upgrade

2023-10-28 Thread Eugen Block
Ah yes, this is a real classic  ;-) I assume that after bootstrapping  
the first node no update to the ceph.conf was done. Anyway, good luck  
with the rest of the upgrade!


Zitat von Jorge Garcia :


I think I figured it out. The problem was that my ceph.conf file only
listed the first machine in mon_initial_members and in mon_host. I'm not
sure why. I added the other monitors, restarted the monitors and the
managers, and everything is now working as expected. I have now upgraded
all the monitors and all the managers to Pacific and Rocky 9. Now on to the
OSDs. Well, maybe next week...

On Thu, Oct 26, 2023 at 5:37 PM Tyler Stachecki 
wrote:


On Thu, Oct 26, 2023, 8:11 PM Jorge Garcia  wrote:


Oh, I meant that "ceph -s" just hangs. I didn't even try to look at the
I/O. Maybe I can do that, but the "ceph -s" hang just freaked me out.

Also, I know that the recommended order is mon->mgr->osd->mds->rgw, but
when you run mgr on the same hardware as the monitors, it's hard to not
upgrade both at the same time. Particularly if you're upgrading the whole
machine at once. Here's where upgrading to the new container method will
help a lot! FWIW, the managers seem to be running fine.



I recently did something like this, so I understand that it's difficult.
Most of my testing and prep-work was centered around exactly this problem,
which was avoided by first upgrading mons/mgrs to an interim OS while
remaining on Octopus -- solely for the purposes of opening an avenue from
Octopus to Quincy separate from tbe OS upgrade.

In my pre-prod resting, trying to upgrade the mons/mgrs without that
middle step that allowed mgrs to be upgraded separately did result in `ceph
-s` locking up. Client I/O remained non-impacted in this state though.

Maybe look at which mgr is active and/or try stopping all but the Octopus
mgr when stopping the mon as well?

Cheers,
Tyler



On Thu, Oct 26, 2023 at 4:57 PM Tyler Stachecki <
stachecki.ty...@gmail.com> wrote:


On Thu, Oct 26, 2023 at 6:52 PM Jorge Garcia 
wrote:
>
> Hi Tyler,
>
> Maybe you didn't read the full message, but in the message you will
notice that I'm doing exactly that, and the problem just occurred when I
was doing the upgrade from Octopus to Pacific. I'm nowhere near  
Quincy yet.

The original goal was to move from Nautilus to Quincy, but I have gone to
Octopus (no problems) and now to Pacific (problems).

I did not, apologies -- though do see my second message about ordering
mon/mgr ordering...

When you say "the cluster becomes unresponsive" -- does the client I/O
lock up, or do you mean that `ceph -s` and such hangs?

May help to look to Pacific mons via the asok and see if they respond
in such a state (and their status) if I/O is not locked up and you can
afford to leave it in that state for a couple minutes:
$ ceph daemon mon.name mon_status

Cheers,
Tyler




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-28 Thread Joachim Kraftmayer - ceph ambassador

Hi,

I know similar requirements, the motivation and the need behind them.
We have chosen a clear approach to this, which also does not make the 
whole setup too complicated to operate.
1.) Everything that doesn't require strong consistency we do with other 
tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies 
with high IOPs and low latencies.


2.) Everything that requires high data security, strong consistency and 
higher failure domains as host we do with Ceph.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 27.10.23 um 17:58 schrieb Anthony D'Atri:

Ceph is all about strong consistency and data durability.  There can also be a 
distinction between performance of the cluster in aggregate vs a single client, 
especially in a virtualization scenario where to avoid the noisy-neighbor 
dynamic you deliberately throttle iops and bandwidth per client.


For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
capable of writing about 8GiB/s, which is about 64GBit/s.

Written how, though?  Benchmarks sometimes are written with 100% sequential 
workloads, top-SKU CPUs that mortals can't afford, and especially with a queue 
depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty 
much random and with lower QD.  And depending on the drive, significant read 
traffic may impact write bandwidth to a degree.  At . Mountpoint (Vancouver 
BC 2018) someone gave a presentation about the difficulties saturating NVMe 
bandwidth.


Now considering the situation that you have 5 nodes each has 4 of that drives,
will make all small and mid-sized companies to go bankrupt ;-) only from buying 
the corresponding networking switches.

Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read 
intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please stick with real 
enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
get SSDs elsewhere for half of what they cost from your chassis manufacturer.


   But the servers hardware is still a simplistic commodity hardware which can 
saturate the given any given commodity network hardware easily.
If I want to be able to use full 64GBit/s I would require at least 100GBit/s 
networking or tons of trunked ports and cabaling with lower bandwidth switches.

Throughput and latency are different things, though.  Also, are you assuming 
here the traditional topology of separate public and 
cluster/private/replication networks?  With modern networking (and Ceph 
releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're


   If we now also consider distributing the nodes over racks, building on same 
location or distributed datacenters, the costs will be even more painfull.

Don't you already have multiple racks?  They don't need to be dedicated only to 
Ceph.


The ceph commit requirement will be 2 copies on different OSDs (comparable to a 
mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID 
with multiple disk redudancy)

Not entirely comparable, but the distinctions mostly don't matter here.


In all our tests so far, we could not control the behavior of how ceph is 
persisting this 2 copies. It will always try to persist it somehow over the 
network.
Q1: Is this behavior mandatory?

It's a question of how important the data is, and how bad it would be to lose 
some.


   Our common workload, and afaik nearly all webservice based applications are:
- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster

QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these 
days if you shop smart you can get TLC for close the same cost.  Won't always 
be true though, and you can't get a 60TB TLC SKU ;)


Hope I could explain the situation here well enough.
 Now assuming my ideal world with ceph:
if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill 
the common needs of ceph storage with 4 copies

You could I think craft a CRUSH rule to do that.  Default for replicated pools 
FWIW is 3 copies not 4.


3. maybe optionally move 1 copy away from the intial node which still holds the 
2 local copies...

I don't know of an elegant way to change placement after the fact.


   this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the 
local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and 
the latency would be comparable as writing locally
- we