On Thu, Jan 28, 2021 at 7:34 PM Schoonjans, Tom (RFI,RAL,-) < tom.schoonj...@rfi.ac.uk> wrote:
> Hi Yuval, > > > Together with Tom Byrne I ran some more tests today while keeping an eye > on the logs as well. > > We immediately noticed that the nodes were logging errors when uploading > files like: > > 2021-01-28 16:10:45.825 7f56ff5cf700 1 ====== starting new request > req=0x7f56ff5c87f0 ===== > 2021-01-28 16:10:45.828 7f5721e14700 1 AMQP connect: exchange mismatch > 2021-01-28 16:10:45.828 7f5721e14700 1 ERROR: failed to create push > endpoint: amqp://<username>:<password>@<my.rabbitmq.server>:5672 due to: > pubsub endpoint configuration error: AMQP: failed to create connection to: > amqp://<username>:<password>@<my.rabbitmq.server>:5672 > 2021-01-28 16:10:45.828 7f571ee0e700 1 ====== req done req=0x7f571ee077f0 op > status=0 http_status=200 latency=0.0569997s ====== > > > Which resulted in no connections being established to the RabbitMQ server. > > Tom restarted then the Ceph services on one gateway node, which led to > events being sent to RabbitMQ without blocking, but only if this particular > node was picked up by the boto3 upload request in the round-robin DNS. > > Restarting the Ceph service on all nodes fixed the problem and I got a > nice steady stream of events to my consumer Python script! > > we should fix it. no restart should be needed if one of the connection parameters was wrong > I did notice that any events that were sent while my consumer script was > not running are lost, as they are not picked up after I restart the script. > Any thoughts on this? > > this is strange. in our code [1] we don't require immediate transfer of messages. how is the exchange declared? can you check if this is happening when you send messages from a python producer as well? [1] https://github.com/ceph/ceph/blob/master/src/rgw/rgw_amqp.cc#L575 > Many thanks!! > > Best, > > Tom > > > > Dr Tom Schoonjans > > Research Software Engineer - HPC and Cloud > > Rosalind Franklin Institute > Harwell Science & Innovation Campus > Didcot > Oxfordshire > OX11 0FA > United Kingdom > > https://www.rfi.ac.uk > > The Rosalind Franklin Institute is a registered charity in England and > Wales, No. 1179810 Company Limited by Guarantee Registered in England > and Wales, No.11266143. Funded by UK Research and Innovation through > the Engineering and Physical Sciences Research Council. > > On 27 Jan 2021, at 16:21, Yuval Lifshitz <ylifs...@redhat.com> wrote: > > > On Wed, Jan 27, 2021 at 5:34 PM Schoonjans, Tom (RFI,RAL,-) < > tom.schoonj...@rfi.ac.uk> wrote: > >> Looks like there’s already a ticket open for AMQP SSL support: >> https://tracker.ceph.com/issues/42902 (you opened it ;-)) >> >> I will give a try myself if I have some time, but don’t hold your breath >> with lockdown and home schooling. Also I am not much of a C++ coder. >> >> I need to go over the logs with Tom Byrne to see why it is not working >> properly. And perhaps I will be able to come up with a fix then. >> >> However this is what I have run into so far today: >> >> 1. After configuring a bucket with a topic using the non-SSL port, I >> tried a couple of uploads to this bucket. They all hanged, which seemed >> like something was very wrong, so I Ctrl-C’ed every time. After some time I >> figured out from the RabbitMQ admin UI that Ceph was indeed connecting to >> it, and the connections remained so I killed them from the UI. >> > > sending the notification to the rabbitmq server is synchronous with the > upload to the bucket. so, if the server is slow or not acking the > notification, the upload request would hang. not that the upload itself is > done first, but the reply to the client does not happen until rabbitmq > server acks. > > would be great if you can share the radosgw logs. > maybe the issue is related to the user/password method we use? we use: > AMQP_SASL_METHOD_PLAIN > > one possible workaround would be to set "amqp-ack-level" to "none". in > this case the radosgw does not wait for an ack > > in "pacific" you could use "persistent topics" where the notifications are > sent asynchronously to the endpoint. > > 2. I then wrote a python script with Pika to consume the events, hoping >> that would stop the blocking. I had some minor success with this. Usually >> the first three or four uploaded files would generate events that I could >> consume with my script. >> > > the radosgw is waiting for an ack from the broker, not the end consumer, > so this should not have mattered... > did you actually see any notifications delivered to the consumer? > > >> However, the rest would block for ever. I repeated this a couple of times >> but always the same result. I noticed that after I stopped uploading, >> removed the bucket and the topic, the connection from Ceph in the RabbitMQ >> UI remained. I killed it but it came back seconds later from another port >> on the Ceph cluster. I ended up playing whack-a-mole with this until no >> more connections would be established from Ceph to RabbitMQ. I probably >> killed a 100 or so of them. >> > > once you remove the bucket there cannot be new notification sent. if you > create the bucket again you may see notifications again (this is fixed in > "pacific"). > either way, even if the connection to the rabbitmq server would still be > open, but no new notifications should be sent there. just having the > connection should not be an issue but would be nice to fix that as well: > https://tracker.ceph.com/issues/49033 > > 3. After this I couldn’t get any events sent anymore. There is no more >> blocking when uploading, files get written but nothing else happens. No >> connections are made anymore from Ceph to RabbitMQ. >> >> Hope this helps… >> > > yes, this is very helpful! > > >> Best, >> >> Tom >> >> >> >> >> Dr Tom Schoonjans >> >> Research Software Engineer - HPC and Cloud >> >> Rosalind Franklin Institute >> Harwell Science & Innovation Campus >> Didcot >> Oxfordshire >> OX11 0FA >> United Kingdom >> >> https://www.rfi.ac.uk >> >> The Rosalind Franklin Institute is a registered charity in England and >> Wales, No. 1179810 Company Limited by Guarantee Registered in England >> and Wales, No.11266143. Funded by UK Research and Innovation through >> the Engineering and Physical Sciences Research Council. >> >> On 27 Jan 2021, at 13:04, Yuval Lifshitz <ylifs...@redhat.com> wrote: >> >> >> >> On Wed, Jan 27, 2021 at 11:33 AM Schoonjans, Tom (RFI,RAL,-) < >> tom.schoonj...@rfi.ac.uk> wrote: >> >>> Hi Yuval, >>> >>> >>> Switching to non-SSL connections to RabbitMQ allowed us to get things >>> working, although currently it’s not very reliable. >>> >> >> can you please add more about that? what reliability issues did you see? >> >> >>> I will open a new ticket over this if we can’t fix things ourselves. >>> >>> >> this would be great. we have ssl support for kafka and http endpoint, so, >> if you decide to give it a try you can look at them as examples. >> and let me know if you have questions or need help. >> >> >> >>> I will open an issue on the tracker as soon as my account request has >>> been approved :-) >>> >>> Best, >>> >>> Tom >>> >>> >>> >>> >>> >>> Dr Tom Schoonjans >>> >>> Research Software Engineer - HPC and Cloud >>> >>> Rosalind Franklin Institute >>> Harwell Science & Innovation Campus >>> Didcot >>> Oxfordshire >>> OX11 0FA >>> United Kingdom >>> >>> https://www.rfi.ac.uk >>> >>> The Rosalind Franklin Institute is a registered charity in England and >>> Wales, No. 1179810 Company Limited by Guarantee Registered in England >>> and Wales, No.11266143. Funded by UK Research and Innovation through >>> the Engineering and Physical Sciences Research Council. >>> >>> On 26 Jan 2021, at 20:02, Yuval Lifshitz <ylifs...@redhat.com> wrote: >>> >>> >>> >>> On Tue, Jan 26, 2021 at 9:48 PM Schoonjans, Tom (RFI,RAL,-) < >>> tom.schoonj...@rfi.ac.uk> wrote: >>> >>>> Hi Yuval, >>>> >>>> >>>> I worked on this earlier today with Tom Byrne and I think I may be able >>>> to provide some more information. >>>> >>>> I set up the RabbitMQ server myself, and created the exchange with type >>>> ’topic’ before configuring the bucket. >>>> >>>> Not sure if this matters, but the RabbitMQ endpoint is reached over >>>> SSL, using certificates generated with Letsencrypt. >>>> >>>> >>> it actually does. we don't support amqp over ssl. >>> feel free to open a tracker for that - as we should probably support >>> that! >>> but note that it would probably be backported only to later versions >>> than nautilus. >>> >>> >>> >>>> Many thanks, >>>> >>>> Tom >>>> >>>> >>>> >>>> Dr Tom Schoonjans >>>> >>>> Research Software Engineer - HPC and Cloud >>>> >>>> Rosalind Franklin Institute >>>> Harwell Science & Innovation Campus >>>> Didcot >>>> Oxfordshire >>>> OX11 0FA >>>> United Kingdom >>>> >>>> https://www.rfi.ac.uk >>>> >>>> The Rosalind Franklin Institute is a registered charity in England and >>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England >>>> and Wales, No.11266143. Funded by UK Research and Innovation through >>>> the Engineering and Physical Sciences Research Council. >>>> >>>> On 26 Jan 2021, at 19:37, Yuval Lifshitz <ylifs...@redhat.com> wrote: >>>> >>>> Hi Tom, >>>> Did you create the exchange in rabbitmq? The RGW does not create it and >>>> assume it is already created? >>>> Could you increase the log level in RGW and see if there are more log >>>> messages that have "AMQP" in them? >>>> >>>> Thanks, >>>> >>>> Yuval >>>> >>>> On Tue, Jan 26, 2021 at 7:33 PM Byrne, Thomas (STFC,RAL,SC) < >>>> tom.by...@stfc.ac.uk> wrote: >>>> >>>>> Hi all, >>>>> >>>>> We've been trying to get RGW Bucket notifications working with a >>>>> RabbitMQ endpoint on our Nautilus 14.2.15 cluster. The gateway host can >>>>> communicate with the rabbitMQ server just fine, but when RGW tries to send >>>>> a message to the endpoint, the message never appears in the queue, and we >>>>> get this error from in the RGW logs: >>>>> >>>>> 2021-01-26 16:28:17.271 7f0468b1f700 1 push to endpoint AMQP(0.9.1) >>>>> Endpoint >>>>> URI: amqp://user:pass@host:5671 >>>>> Topic: ceph-topic-test >>>>> Exchange: ceph-test >>>>> Ack Level: broker failed, with error: -4098 >>>>> >>>>> We've confirmed the URI is correct, and that the gateway host can send >>>>> messages to the RabbitMQ via a standalone script (using the same >>>>> information as in the URI). Does anyone have any hints about how to dig >>>>> into this? >>>>> >>>>> Cheers, >>>>> Tom >>>>> >>>>> This email and any attachments are intended solely for the use of the >>>>> named recipients. If you are not the intended recipient you must not use, >>>>> disclose, copy or distribute this email or any of its attachments and >>>>> should notify the sender immediately and delete this email from your >>>>> system. UK Research and Innovation (UKRI) has taken every reasonable >>>>> precaution to minimise risk of this email or any attachments containing >>>>> viruses or malware but the recipient should carry out its own virus and >>>>> malware checks before opening the attachments. UKRI does not accept any >>>>> liability for any losses or damages which the recipient may sustain due to >>>>> presence of any viruses. Opinions, conclusions or other information in >>>>> this >>>>> message and attachments that are not related directly to UKRI business are >>>>> solely those of the author and do not represent the views of UKRI. >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>> >>>>> >>>> >>> >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io