Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes

Adam Lindley Mon, 04 Jun 2018 09:54:38 -0700

Hey Matthew,

Quite a lot of questions, so hopefully all the answers are clear. Is the issue 
still the same Rest Client failure?
At a high level, a lot of this looks like networking issues on the Kubernetes 
rig side, not the Clearwater side.  I don't have any experience with Weave, 
which I believe you said your set up was using for the overlay network, and so 
my ability to help there may be fairly limited. Definitely keen to get 
everything up and running, but especially without access to the rig to poke 
around it may be more difficult.


Also, a couple of thoughts on diags:

  *   At the moment I don't think there's much interesting in the clearwater 
logs, as there's no network traffic hitting the processes. I suspect at the 
moment you're just sending over short snippets as the rest of the log is nearly 
identical, however when we get past network related issues, it will be much 
more useful to have the full log, or much larger excerpts at least.
  *   The tcpdump output you've sent over is pretty difficult to get anything 
from, with it both just being the CLI output and with no context on what the 
IPs involved are. If it's possible for future runs, can you grab the full 
packet ouput by having tcpdump write to a file? Something like `tcpdump -i any 
port xxxx -w dump.pcap` should do nicely. We can then open that up and see the 
full contents of all the messages. This will make debugging issues much easier


So, to answer your questions:

  *   When I run ` nc 10.3.1.76 32060 -v` I see nothing (not success, not 
failure. Just no output and it hangs)
Is the `32060` here just a bad copy-paste, as you've got `30060` below? If not, 
that's the issue.
If not, then this seems a likely symptom of some network issue.
On our rig, I see:
```root@bono-2597123395-s919s:/# nc 10.230.16.1 -v 30060
Connection to 10.230.16.1 30060 port [tcp/*] succeeded!```
And my bono-depl says the same, and is working, so don't think that's the 
problem.
If you're unable to get traffic to connect to the bono pod/process, we're not 
going to be able to get any further, so that'll be the thing to look into. 
You'll need to track the packets here, and see where they're hitting some 
issues.


Changing the PUBLIC IP to match my rig IP.
Where is this? In bono-deply.yaml? Anywhere else? (should I have it in .env?) 
Should this include the port number?
I have only change the depl file here. Nowhere else.
What is this `.env` file. I have not come across these, and don't think it's 
got any place here. Can you give some more info on what it is and how/where 
you're using it?

Which IP should this be? Currently I have it equal to the Kubernetes API IP 
address.
When I run ` kubectl config view | grep 10` I see `server: 
https://10.3.1.76:6443`<https://10.3.1.76:6443%60> so I set PUBLIC_IP to 
10.3.1.76.
What do you see? I think 6443 might be a non-standard port for Kubernetes. 
Could that be a source of the problem?
I'm afraid I'm not entirely certain here. This may well be some of the answer a 
bit later on down the line.
Looking at the network setup on our host, I have this IP as the IP of interface 
cbr0. As I mentioned before, our rig is set up with directly routable 
networking, rather than e.g. weave or flannel overlay networks. The IP I am 
using is of the default gateway on the pod network.
However, I don't think this is the reason you are seeing nothing reach your 
bono pod. I've just run a test setting the public IP to the kubernetes host VM 
IP, and am seeing the same results as before, with the live tests passing.

My .env file currently says "PUBLIC_IP=" (without an address)
As above, I'm not really sure what this is, or what you're using it for.

Changed the image pull source to our internal one: Hmm, maybe my images were 
built wrong? Once my pull requests are merged, I'll do a fresh clone and 
rebuild.
I doubt this will help. The clearwater software is running fine, just receiving 
no data. If you aren't able to get packets routed into the bono pod, we won't 
be hitting any of the software in there, so images aren't the issue. If you're 
able to confirm traffic is reaching bono, but being rejected, then we can look 
at this

Changed the zone in the config map to my own one, 'ajl.svc.cw-k8s.test': Why 
are you using the non-default zone ajl.svc.cw-k8s.test? Where did that come 
from? I thought my original Azure issue was because I used a zone which wasn't 
default.svc.cluster.local. What is that parameter? What does it mean? Why is 
the default what it is?
Our rig is configured to allow multiple users running in their own namespaces. 
This simple stops my deployment clashing with any other.

My test command is:
rake test[default.svc.cw-k8s.test] PROXY=10.3.1.76 PROXY_PORT=30060 
SIGNUP_CODE='secret' ELLIS=10.3.1.76:30080 TESTS="Basic call - mainline"

I'm wrapping secret in single apostrophes. Without the apostrophes the result 
is the same.
This looks reasonable. However, if you're still getting the Rest client error, 
I suspect something strange is happening. Would be very helpful to have full 
pcap files, possible on the test box, bono, and ellis all at the same time. Can 
you get these using a command like above?
You should be able to open them up in Wireshark, and take a deeper look into 
them; see if you can find anything unusual in the messages between the test box 
and ellis.
As I said before, a lot of this seems to be network related, and so even with 
diagnostics like these I'm going to be less able to help. You may well have 
better luck, and a faster turn around, digging in to them there, when you'll be 
able to see what's missing and grab that too. Follow the flow between test box 
and clearwater, and see when the message start dropping or getting error 
responses back, and you should be able to track it down a fair bit. If you're 
not sure how to go about that, ping back :)

Every pod (except etcd) has a shared_config file which says:

# Keys
signup_key=secret
turn_workaround=secret
ellis_api_key=secret
ellis_cookie_key=secret


From: Davis, Matthew [mailto:matthew.davi...@team.telstra.com]
Sent: 04 June 2018 09:24
To: Adam Lindley <adam.lind...@metaswitch.com>; 
clearwater@lists.projectclearwater.org
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

I forgot to mention,

When I run ` nc 10.3.1.76 32060 -v` I see nothing (not success, not failure. 
Just no output and it hangs)

Also, my bono-depl.yaml file says "containerPort: 5060" not 30060. Is that 
right?



When I run a tcp dump on the test machine I see dump.txt (attached)
(I'm not sure how this mailing list will cope with attachments)

The bono logs from that time are attached as bono_log.txt. (I inserted comments 
with ### to make it clear when the test was running.)

For some reason there is no soft link called /var/log/ellis/ellis_current.txt
The contents of ellis-err.log is "No handlers could be found for logger 
"metaswitch.utils"" I'm not sure whether that appeared during the test or not.
ellis_20180604T070000Z.txt is empty. (No new entries during the test)

I modified the log level on both ellis and bono, and restarted the bono and 
ellis service respectively, prior to running the test.

Regards,
Matt



From: Davis, Matthew
Sent: Monday, 4 June 2018 5:32 PM
To: 'Adam Lindley' 
<adam.lind...@metaswitch.com<mailto:adam.lind...@metaswitch.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi Adam,
Let's focus on the 3 differences:

Changing the PUBLIC IP to match my rig IP.
Where is this? In bono-deply.yaml? Anywhere else? (should I have it in .env?) 
Should this include the port number?
Which IP should this be? Currently I have it equal to the Kubernetes API IP 
address.
When I run ` kubectl config view | grep 10` I see `server: 
https://10.3.1.76:6443`<https://10.3.1.76:6443%60> so I set PUBLIC_IP to 
10.3.1.76.
What do you see? I think 6443 might be a non-standard port for Kubernetes. 
Could that be a source of the problem?
My .env file currently says "PUBLIC_IP=" (without an address)

Changed the image pull source to our internal one: Hmm, maybe my images were 
built wrong? Once my pull requests are merged, I'll do a fresh clone and 
rebuild.

Changed the zone in the config map to my own one, 'ajl.svc.cw-k8s.test': Why 
are you using the non-default zone ajl.svc.cw-k8s.test? Where did that come 
from? I thought my original Azure issue was because I used a zone which wasn't 
default.svc.cluster.local. What is that parameter? What does it mean? Why is 
the default what it is?

My test command is:
rake test[default.svc.cw-k8s.test] PROXY=10.3.1.76 PROXY_PORT=30060 
SIGNUP_CODE='secret' ELLIS=10.3.1.76:30080 TESTS="Basic call - mainline"

I'm wrapping secret in single apostrophes. Without the apostrophes the result 
is the same.

Every pod (except etcd) has a shared_config file which says:

# Keys
signup_key=secret
turn_workaround=secret
ellis_api_key=secret
ellis_cookie_key=secret


Thanks,
Matt

From: Adam Lindley [mailto:adam.lind...@metaswitch.com]
Sent: Friday, 1 June 2018 3:29 AM
To: Davis, Matthew 
<matthew.davi...@team.telstra.com<mailto:matthew.davi...@team.telstra.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hey Matthew,

Sorry to hear it isn't working yet. Looking at the output of rake test, you're 
hitting a forbidden in the RestClient, and so it's something on the Http calls. 
I think your issue may just be a mismatch in the value of 'SIGNUP_CODE' that 
you're passing in to the rake test command, and what's on your deployment. This 
should default to 'secret', and I don't see anything in your configmap that 
would change that, so can you double check the test command includes " 
SIGNUP_CODE='secret' ". If you have made any changes to the signup key then 
obviously it'll need to match those. If you want to double check the value, 
take a look in the /etc/clearwater/shared_config file in one of the pods.

I have double checked a deployment on our rig, copying your provided yaml files 
over directly to make sure there isn't anything odd in there, and the live 
tests were able to run fine, following the deployment steps you're using. I did 
have some trouble getting the script to work as copied over, but think that's 
just outlook formatting quotes wrong etc. Manually running the commands it all 
worked as expected. The only changes I made were:

  *   Changing the PUBLIC IP to match my rig IP.
  *   Changed the image pull source to our internal one
  *   Changed the zone in the config map to my own one, 'ajl.svc.cw-k8s.test'

With this all deployed, the following test command passed with no issue:
    rake test[ajl.svc.cw-k8s.test] PROXY=10.230.16.1 PROXY_PORT=30060 
SIGNUP_CODE='secret' ELLIS=10.230.16.1:30080 TESTS="Basic call - mainline"

If the issue isn't the signup key, can we try getting some more diags that we 
can take a look into? In particular, I think we would benefit from:

  *   A packet capture on the node you are running the live tests on, when you 
hit the errors below
  *   The bono logs, at debug level, from the same time. To set up debug 
logging, you need to add 'log_level=5' to /etc/clearwater/user_settings 
(creating if needed), and restart the bono service
  *   The ellis logs from the same time

Running the tcpdump on the test node should mean we get to see the full set of 
flows, and you can likely read through that yourself to work out any following 
issues you find hiding behind this next one.
Any other diagnostics you can gather would obviously also be useful, but with 
the above, assuming traffic is reaching the pods, we should be able to work out 
the issue.

On your connectivity tests, you won't be able to connect to the bono service 
using 'nc localhost 30060', because that attempts to connect using the 
localhost IP. We have set the bono service up to listen on the 'PUBLIC_IP', 
i.e. the host IP. If you try running 'nc 10.3.1.76 30060 -v' you should see 
successful connection. (Or on whichever host IP you have configured it to 
listen).
The log output you are seeing on restarting Bono is also benign. These are 
again simply an artefact of some behaviour we want to have in VMs, but that is 
not needed in these containers. You can safely ignore this output.

Good luck, and let us know where you get with debugging.
Cheers,
Adam

From: Davis, Matthew [mailto:matthew.davi...@team.telstra.com]
Sent: 30 May 2018 08:27
To: Adam Lindley 
<adam.lind...@metaswitch.com<mailto:adam.lind...@metaswitch.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi Adam,
# Openstack Install

I only mentioned the helm charts just in case the almost empty charts on my 
machine were the source of the error. I personally have no experience with 
Helm, so I can't help you with any development.
I applied that latest change to the bono service port number. It still doesn't 
work.

How can I check whether the rake tests are failing on the bono side, as opposed 
to failing on the ellis side? Maybe the reason tcpdumps shows nothing in Bono 
during the rake tests is because the tests failed to create a user in ellis, 
and never got to the bono part?

Rake output:
```
Basic Call - Mainline (TCP) - Failed
  RestClient::Forbidden thrown:
   - 403 Forbidden
     - 
/home/ubuntu/.rvm/gems/ruby-1.9.3-p551/gems/rest-client-1.8.0/lib/restclient/abstract_response.rb:74:in
 `return!'
     - 
/home/ubuntu/.rvm/gems/ruby-1.9.3-p551/gems/rest-client-1.8.0/lib/restclient/request.rb:495:in
 `process_result'
...
```
If I go inside the ellis container and run `nc localhost 80 -v` I see that it 
establishes a connection.
If I go inside the bono container and run `nc localhost 5060 -v` or `nc 
localhost 30060 -v` it fails to connect. So from within the bono pod I cannot 
connect to the localhost. To me that suggests that the problem is caused by 
something inside bono, not the networking between pods. What happens when you 
try `nc localhost 32060 -v` in your deployment?
The logs inside bono are an echo of an error message from sprout. Does that 
matter?

```
30-05-2018 06:46:59.432 UTC [7f45527e4700] Status sip_connection_pool.cpp:428: 
Recycle TCP connection slot 4
30-05-2018 06:47:06.222 UTC [7f4575ae0700] Status alarm.cpp:244: Reraising all 
alarms with a known state
30-05-2018 06:47:06.222 UTC [7f4575ae0700] Status alarm.cpp:37: sprout issued 
1012.3 alarm
30-05-2018 06:47:06.222 UTC [7f4575ae0700] Status alarm.cpp:37: sprout issued 
1013.3 alarm
```

Those timestamps don't correspond to the rake tests. They just happen every 30 
seconds.

When I restart the bono service it says: `63: ulimit: open files: cannot modify 
limit: Invalid argument`
Does that matter? (I've seen that error message everywhere. I have no idea what 
it means)

I've appended the yaml files to the end of this email.

# Azure Install

I had a chat to Microsoft. It seems that your hunch was correct. HTTP 
Application Routing only works on port 80 and 443. Furthermore, I cannot simply 
route SIP calls through port 443 because the routing does some HTTP specific 
packet inspection things. So I'll have to give up on that approach and go for a 
more vanilla, manually configured NodePort approach (either still on AKS but 
without HTTP Application Routing, or on Openstack). So I'm even more keen to 
solve the aforementioned issues.


# Yamls and stuff

I'm pasting them again just in case I've forgotten something. 10.3.1.76 is the 
ip address of my cluster.

Here's a script I'm using to tear down and rebuild everything. (Just incase 
`kubectl apply -f something.yaml` doesn't actually propagate the change fully) 
The while loops in this script are just to wait until the previous step has 
finished.

```
set -x
cd clearwater-docker-master/kubernetes
kubectl delete -f ./
kubectl delete configmap env-vars
set -e
echo 'waiting until old pods are all deleted'
while [ $(kubectl get pods | grep ^NAME -v | wc -l ) -neq 0]
do
   sleep 5
done
echo "creating new pods"
kubectl create configmap env-vars --from-literal=ZONE=default.svc.cluster.local
kubectl apply -f ./
while [ $(kubectl get pods | grep "2/2" | grep bono | wc -l) -neq 1 ]
do
   sleep 5
done
BONO=$(kubectl get pods | grep "2/2" | grep bono | awk '{ print $1 }')
echo "Bono is up as $BONO"
kubectl exec -it $BONO -- apt-get -y install vim
kubectl exec -it $BONO -- sed -i -e 's/--pcscf=5060,5058/--pcscf=30060,5058/g' 
/etc/init.d/bono
kubectl exec -it $BONO service bono restart
while [ $(kubectl get pods | grep "0/" | wc -l) -neq 0 ]
do
   sleep 5
done
echo "All pods are up now"
kubectl get pods
echo "Done"
```

`kubectl get services`

```
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)               
                          AGE
astaire          ClusterIP   None           <none>        11311/TCP             
                          1h
bono             NodePort    10.0.168.197   <none>        
3478:32214/TCP,30060:30060/TCP,5062:30144/TCP   1h
cassandra        ClusterIP   None           <none>        
7001/TCP,7000/TCP,9042/TCP,9160/TCP             1h
chronos          ClusterIP   None           <none>        7253/TCP              
                          1h
ellis            NodePort    10.0.53.199    <none>        80:30080/TCP          
                          1h
etcd             ClusterIP   None           <none>        
2379/TCP,2380/TCP,4001/TCP                      1h
homer            ClusterIP   None           <none>        7888/TCP              
                          1h
homestead        ClusterIP   None           <none>        8888/TCP              
                          1h
homestead-prov   ClusterIP   None           <none>        8889/TCP              
                          1h
kubernetes       ClusterIP   10.0.0.1       <none>        443/TCP               
                          5d
ralf             ClusterIP   None           <none>        10888/TCP             
                          1h
sprout           ClusterIP   None           <none>        5052/TCP,5054/TCP     
                          1h
```

bono-depl.yaml

```
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: bono
spec:
  replicas: 1
  selector:
    matchLabels:
      service: bono
  template:
    metadata:
      labels:
        service: bono
        snmp: enabled
    spec:
      containers:
      - image: "mlda065/bono:latest"
        imagePullPolicy: Always
        name: bono
        ports:
        - containerPort: 22
        - containerPort: 3478
        - containerPort: 5060
        - containerPort: 5062
        - containerPort: 5060
          protocol: "UDP"
        - containerPort: 5062
          protocol: "UDP"
        envFrom:
        - configMapRef:
              name: env-vars
        env:
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PUBLIC_IP
          value: 10.3.1.76
        livenessProbe:
          exec:
            command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "3478 
5062"]
          initialDelaySeconds: 30
        readinessProbe:
          exec:
            command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "3478 
5062"]
        volumeMounts:
        - name: bonologs
          mountPath: /var/log/bono
      - image: busybox
        name: tailer
        command: [ "tail", "-F", "/var/log/bono/bono_current.txt" ]
        volumeMounts:
        - name: bonologs
          mountPath: /var/log/bono
      volumes:
      - name: bonologs
        emptyDir: {}
      imagePullSecrets:
      - name: ~
      restartPolicy: Always
```

Bono-svc.yaml

```
apiVersion: v1
kind: Service
metadata:
  name: bono
spec:
  type: NodePort
  ports:
  - name: "3478"
    port: 3478
  - name: "5060"
    port: 30060
    nodePort: 30060
  - name: "5062"
    port: 5062
  selector:
    service: bono
```

Ellis service

```
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: ellis
spec:
  replicas: 1
  template:
    metadata:
      labels:
        service: ellis
    spec:
      containers:
      - image: "mlda065/ellis:latest"
        imagePullPolicy: Always
        name: ellis
        ports:
        - containerPort: 22
        - containerPort: 80
        envFrom:
        - configMapRef:
              name: env-vars
        env:
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 30
        readinessProbe:
          tcpSocket:
            port: 80
      imagePullSecrets:
      - name: ~
      restartPolicy: Always
```

Ellis-svc.yaml

```
apiVersion: v1
kind: Service
metadata:
  name: ellis
spec:
  type: NodePort
  ports:
  - name: "http"
    port: 80
    nodePort: 30080
  selector:
    service: ellis
```

`kubectl describe configmap env-vars`

```
Name:         env-vars
Namespace:    default
Labels:       <none>
Annotations:  
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","data":{"ZONE":"default.svc.cluster.local"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"env-vars","namespace":"default"}...

Data
====
ZONE:
----
default.svc.cluster.local
Events:  <none>
```



Thanks,

Matt
Telstra Graduate Engineer
CTO | Cloud SDN NFV

From: Adam Lindley [mailto:adam.lind...@metaswitch.com]
Sent: Friday, 25 May 2018 6:42 PM
To: Davis, Matthew 
<matthew.davi...@team.telstra.com<mailto:matthew.davi...@team.telstra.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi Matthew,

Our Helm support is a recent addition, and came from another external 
contributor. See the Pull Request at 
https://github.com/Metaswitch/clearwater-docker/pull/85 for the details :)
As it stands at the moment, the chart is good enough for deploying and 
re-creating a full standard deployment through Helm, but I don't believe it 
handles more of the complexities of upgrading a clearwater deployment that it 
potentially could.

We haven't yet done any significant work in setting up Helm charts, or 
integrating with them in a more detailed manner, so if that's something you're 
interested in as well, we'd love to work with you to get some more enhancements 
in. Especially if you have other expert contacts who know more in this area.

(I'm removing some of the thread in the email below, to keep us below the list 
limits. The online archives will keep all the info though)

Cheers,
Adam

_______________________________________________
Clearwater mailing list
Clearwater@lists.projectclearwater.org
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes

Reply via email to