Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes

Adam Lindley Thu, 16 Aug 2018 01:19:34 -0700

Hey there,

Sounds like you've made great progress! Glad you were able to get past some of 
the problems you were seeing before.
If you're making code changes to get it all working on Azure, or just some 
steps you could add to the documentation, we'd love for you to send that across 
to us in some PRs


Because you've got a fairly long list of things below, I've interspersed 
comments and answers as best I can. Hope this helps :)

Cheers,
Adam


From: Davis, Matthew [mailto:matthew.davi...@team.telstra.com]
Sent: 14 August 2018 05:41
To: Adam Lindley 
<adam.lind...@metaswitch.com<mailto:adam.lind...@metaswitch.com>>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi Adam,
I have managed to deploy it on Azure and make a call.


  *   It's not fully functional though. After about a minute or two the call 
always ends without me actively hanging up. One device notices but the other 
doesn't.
  *   If I hang up the call on one end, the other end doesn't notice
  *   There's a big delay in the audio. I'm wondering, is this because the 
client is configured to use TCP not 
UDP<https://clearwater.readthedocs.io/en/latest/Making_your_first_call.html#configure-your-client>?
[AJL] It sounds like some of the traffic may be disrupted. Have you tried 
running a packet capture on/near one of the devices, and making sure all of the 
messages are getting through?
As for the audio delay, I'm not sure what the issue would be here. Project 
Clearwater doesn't handle media packets, so you would be best served taking a 
look at a packet capture maybe? It could be routing delays or something like 
that in your network.

I'm just wondering, how is state stored? I thought it was stored in Cassandra, 
but Cassandra has no persistent volumes. If I restart the Cassandra pods, do my 
accounts all get deleted? It seems so. I suspect that this may have caused some 
of my troubles.
[AJL] At the moment, the Project Clearwater clearwater-docker project doesn't 
configure Cassandra with persistent volumes, so yes, if you restart the pods 
the data will be lost. In other state stores, like Astaire, you'll see we have 
a pre-stop command that attempts to remove the node safely from the cluster. 
This is aimed at a single node being stopped and started though, not the whole 
set. If you're interested in prototyping that, we'd love to see it though :)


A summary of the changes I made:

  *   Bono and ellis services need to be load balancers. Why are they NodePorts 
by default? I don't understand how NodePorts could possibly work in any use 
case.
[AJL] I don't think I follow your assertion that they need to be load balancers 
here? Is this down to some of the networking in Azure? If you can cover a bit 
more on what you don't understand I may be able to answer more.

  *   I did not enable HTTP Application Routing in Azure
  *   Instead of configuring bono and ellis as subdomains of a common parent 
domain, I assigned separate DNS records for each of them. (For some reason 
Azure won't let you create a subdomain off one of their domains, only 
individual hosts. Unless you do it through HTTP Application Routing. Odd) I 
followed the relevant parts of these instructions for  that. 
https://docs.microsoft.com/en-us/azure/aks/ingress
[AJL] Interesting, and thanks for the link. Still haven't had time to try Azure 
myself yet, but I'll remember this for when I do

  *   The dependency issue with homestead-prov turned out to not matter
  *   The test command that works is: rake test[default.svc.cluster.local] 
PROXY=bono-aks-cw-01.australiaeast.cloudapp.azure.com SIGNUP_CODE=secret 
ELLIS=ellis-aks-cw-01.australiaeast.cloudapp.azure.com
  *   The SIP client I was using (Twinkle, on Ubuntu) is broken. Don't use that
  *   The SIP client recommended by the docs (Zoiper) is also broken. Audio (in 
and out) cuts out often. This is a known bug with Zoiper. You need to restart 
your phone for this to work. Also, Zoiper can't cope with Android/Windows 
phone's power saving settings. You must have the device unlocked with the app 
in the foreground.
  *   I could not get Linphone to work. (It says "authorisation error"), or any 
other SIP client.
  *   Just wondering, are there any SIP clients out there which:
     *   Work
     *   Are free
     *   Don't require every permission under the sun
I couldn't find any.
[AJL] We haven't done much work with softclients running on android handsets, 
through Project Clearwater, so sadly I can't help here much, though others on 
the mailing list might be able to. Might be worth asking in a separate thread, 
as this is a bit buried here.

  *   The .env file is at the root of the git repo: 
https://github.com/Metaswitch/clearwater-docker/blob/master/.env . I did not 
need to change it in the end.
[AJL] Ah, yep. Found it now. Think it's more for docker based deployments, and 
it may be providing nothing atm. May look into that...

  *    I've found that in some of my deployments, when I try to manually create 
a user in the web GUI, I get an error. The POST request payload is '{"status": 
503, "message": "Service Unavailable", "reason": "No available numbers", 
"detail": {}, "error": true}'. I followed 
these<https://clearwater.readthedocs.io/en/stable/Manual_Install.html?highlight=available%20numbers#provision-telephone-numbers-in-ellis>
 instructions to fix that. But perhaps that's why I've had some "403: 
RestClient::Forbidden" failures with rake recently.
[AJL] Interesting. Would be worth looking at the logs on ellis when this is 
happening. /var/log/ellis/ initially. If that looks all OK, the issue may be 
down the line in Homestead-prov. As for the 403s, I think that would likely be 
a different issue. Potentially something going wrong with the shared secret 
maybe? Hard to say. If you start by looking into logs on ellis and bono you may 
be able to track the issue down more/

  *   The Making a 
Call<https://clearwater.readthedocs.io/en/latest/Making_your_first_call.html#configure-your-client>
 docs say "STUN/TURN/ICE:". What is that line meant to be? Is that a typo? Is 
there a line break which shouldn't be there? If I try connecting with Linphone 
with ICE enabled it says "Authorisation error". If I want to enable STUN/TURN, 
I need some other fields which aren't specified by the docs.
[AJL] That looks like a bug in our auto-generation step. The original docs are 
all in markdown. If you look at 
https://github.com/Metaswitch/clearwater-readthedocs/blob/master/docs/Making_your_first_call.md#configure-your-client
 you'll see the proper indentation, which should make it a bit clearer for you.

I'm trying to deploy it on Openstack Kubernetes now.
The Cassandra pods won't come up, because etcd is failing. It's possibly a 
network issue.
[AJL] Sounds possible. Good luck debugging, and let us know how it goes.

Regards,

Matthew Davis
Telstra Graduate Engineer
CTO | Cloud SDN NFV

From: Adam Lindley [mailto:adam.lind...@metaswitch.com]
Sent: Tuesday, 5 June 2018 2:53 AM
To: Davis, Matthew 
<matthew.davi...@team.telstra.com<mailto:matthew.davi...@team.telstra.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hey Matthew,

Quite a lot of questions, so hopefully all the answers are clear. Is the issue 
still the same Rest Client failure?
At a high level, a lot of this looks like networking issues on the Kubernetes 
rig side, not the Clearwater side.  I don't have any experience with Weave, 
which I believe you said your set up was using for the overlay network, and so 
my ability to help there may be fairly limited. Definitely keen to get 
everything up and running, but especially without access to the rig to poke 
around it may be more difficult.

Also, a couple of thoughts on diags:

  *   At the moment I don't think there's much interesting in the clearwater 
logs, as there's no network traffic hitting the processes. I suspect at the 
moment you're just sending over short snippets as the rest of the log is nearly 
identical, however when we get past network related issues, it will be much 
more useful to have the full log, or much larger excerpts at least.
  *   The tcpdump output you've sent over is pretty difficult to get anything 
from, with it both just being the CLI output and with no context on what the 
IPs involved are. If it's possible for future runs, can you grab the full 
packet ouput by having tcpdump write to a file? Something like `tcpdump -i any 
port xxxx -w dump.pcap` should do nicely. We can then open that up and see the 
full contents of all the messages. This will make debugging issues much easier


So, to answer your questions:

  *   When I run ` nc 10.3.1.76 32060 -v` I see nothing (not success, not 
failure. Just no output and it hangs)
Is the `32060` here just a bad copy-paste, as you've got `30060` below? If not, 
that's the issue.
If not, then this seems a likely symptom of some network issue.
On our rig, I see:
```root@bono-2597123395-s919s:/# nc 10.230.16.1 -v 30060
Connection to 10.230.16.1 30060 port [tcp/*] succeeded!```
And my bono-depl says the same, and is working, so don't think that's the 
problem.
If you're unable to get traffic to connect to the bono pod/process, we're not 
going to be able to get any further, so that'll be the thing to look into. 
You'll need to track the packets here, and see where they're hitting some 
issues.


Changing the PUBLIC IP to match my rig IP.
Where is this? In bono-deply.yaml? Anywhere else? (should I have it in .env?) 
Should this include the port number?
I have only change the depl file here. Nowhere else.
What is this `.env` file. I have not come across these, and don't think it's 
got any place here. Can you give some more info on what it is and how/where 
you're using it?

Which IP should this be? Currently I have it equal to the Kubernetes API IP 
address.
When I run ` kubectl config view | grep 10` I see `server: 
https://10.3.1.76:6443`<https://10.3.1.76:6443%60> so I set PUBLIC_IP to 
10.3.1.76.
What do you see? I think 6443 might be a non-standard port for Kubernetes. 
Could that be a source of the problem?
I'm afraid I'm not entirely certain here. This may well be some of the answer a 
bit later on down the line.
Looking at the network setup on our host, I have this IP as the IP of interface 
cbr0. As I mentioned before, our rig is set up with directly routable 
networking, rather than e.g. weave or flannel overlay networks. The IP I am 
using is of the default gateway on the pod network.
However, I don't think this is the reason you are seeing nothing reach your 
bono pod. I've just run a test setting the public IP to the kubernetes host VM 
IP, and am seeing the same results as before, with the live tests passing.

My .env file currently says "PUBLIC_IP=" (without an address)
As above, I'm not really sure what this is, or what you're using it for.

Changed the image pull source to our internal one: Hmm, maybe my images were 
built wrong? Once my pull requests are merged, I'll do a fresh clone and 
rebuild.
I doubt this will help. The clearwater software is running fine, just receiving 
no data. If you aren't able to get packets routed into the bono pod, we won't 
be hitting any of the software in there, so images aren't the issue. If you're 
able to confirm traffic is reaching bono, but being rejected, then we can look 
at this

Changed the zone in the config map to my own one, 'ajl.svc.cw-k8s.test': Why 
are you using the non-default zone ajl.svc.cw-k8s.test? Where did that come 
from? I thought my original Azure issue was because I used a zone which wasn't 
default.svc.cluster.local. What is that parameter? What does it mean? Why is 
the default what it is?
Our rig is configured to allow multiple users running in their own namespaces. 
This simple stops my deployment clashing with any other.

My test command is:
rake test[default.svc.cw-k8s.test] PROXY=10.3.1.76 PROXY_PORT=30060 
SIGNUP_CODE='secret' ELLIS=10.3.1.76:30080 TESTS="Basic call - mainline"

I'm wrapping secret in single apostrophes. Without the apostrophes the result 
is the same.
This looks reasonable. However, if you're still getting the Rest client error, 
I suspect something strange is happening. Would be very helpful to have full 
pcap files, possible on the test box, bono, and ellis all at the same time. Can 
you get these using a command like above?
You should be able to open them up in Wireshark, and take a deeper look into 
them; see if you can find anything unusual in the messages between the test box 
and ellis.
As I said before, a lot of this seems to be network related, and so even with 
diagnostics like these I'm going to be less able to help. You may well have 
better luck, and a faster turn around, digging in to them there, when you'll be 
able to see what's missing and grab that too. Follow the flow between test box 
and clearwater, and see when the message start dropping or getting error 
responses back, and you should be able to track it down a fair bit. If you're 
not sure how to go about that, ping back :)

Every pod (except etcd) has a shared_config file which says:

# Keys
signup_key=secret
turn_workaround=secret
ellis_api_key=secret
ellis_cookie_key=secret


From: Davis, Matthew [mailto:matthew.davi...@team.telstra.com]
Sent: 04 June 2018 09:24
To: Adam Lindley 
<adam.lind...@metaswitch.com<mailto:adam.lind...@metaswitch.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

I forgot to mention,

When I run ` nc 10.3.1.76 32060 -v` I see nothing (not success, not failure. 
Just no output and it hangs)

Also, my bono-depl.yaml file says "containerPort: 5060" not 30060. Is that 
right?



When I run a tcp dump on the test machine I see dump.txt (attached)
(I'm not sure how this mailing list will cope with attachments)

The bono logs from that time are attached as bono_log.txt. (I inserted comments 
with ### to make it clear when the test was running.)

For some reason there is no soft link called /var/log/ellis/ellis_current.txt
The contents of ellis-err.log is "No handlers could be found for logger 
"metaswitch.utils"" I'm not sure whether that appeared during the test or not.
ellis_20180604T070000Z.txt is empty. (No new entries during the test)

I modified the log level on both ellis and bono, and restarted the bono and 
ellis service respectively, prior to running the test.

Regards,
Matt

...

_______________________________________________
Clearwater mailing list
Clearwater@lists.projectclearwater.org
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes

Reply via email to