Dom,

You've mentioned the number of requests made from the frontend to the 
backend, but how many requests are you making from the backend express app 
to other microservices internally? You mention a 20-40ms latency which, I 
agree, seems abnormally high. If you're making 10 such sequential requests, 
that would explain the low 'journey' performance.

Things to look for:
 - How is your docker networking set up? Swarm? k8s? If each microservice 
is running on each node of your production cluster, it may be choosing to 
connect to a remote node rather than one on localhost. Try adding an 
isotope (unique request ID generated at the outermost layer and 
forwarded/logged in every microservice) to see where the request is 
actually traveling. (Tip: CloudFlare sends a CF-RAY header. It's unique per 
request and you can use it this way.)
 - Network routing. Ideally your edge nodes/LB would be externally 
accessible and internal microservice nodes *not* externally accessible. If 
the upstream nodes have external IPs, your DNS may be resolving to the 
external IP, which would be a longer network path and change the latency 
for AWS networking (ALB?). 'traceroute' is your friend here.
 - Are the requests to internal microservices very small? If the size of 
the request/response to/from the internal microservices is smaller than the 
HTTP headers sent across, you should consider a different RPC mechanism.
 - Do you need HTTPS on internal requests? Again, size of total request vs. 
size of payload should be balanced. Terminating SSL on the edge (perhaps in 
an ALB) would reduce the size of the internal requests.
 - Not your fault? Is one of your microservices making a request to a slow 
or rate limited external service? Sending emails, generating PDFs, running 
CC transactions, etc. can be slow so you should run them asynchronously.
 - Slow EC2 instance? Sometimes they are just bunk and only perform at 50% 
of what others do. It's an AWS mystery. Just kill the slow node and create 
a new one.

Alternative RPC mechanisms:
 - Gearman (http://gearman.org/) is particularly useful if you have a 
mixed-language environment. It's fast, stable, supports retry for failed 
nodes, and sends ~10,000 emails a minute for Craigslist.
 - gRPC (https://grpc.io/docs/tutorials/basic/node.html) uses protobuf for 
high-throughput, low latency RPC. Fast, stable, supported by Google.
 - ZeroMQ (https://www.npmjs.com/package/zmq) is more of a socket transport 
than RPC mechanism, but depending on what your upstream microservices are 
doing this can be useful. It can also maintain a socket between services so 
setup/teardown time of the socket is minimized. Downsize: bearbones - 
you'll need to build many features yourself. Upside: Crazy fast. Used by 
high-frequency traders for stock market bots.

Debugging/rearchitecting stuff like this is my jam. Email me if you want to 
talk.

HTH,
Mikkel
Oblivious.io <http://www.oblivious.io/?r=nodejs>


On Thursday, September 13, 2018 at 7:10:22 AM UTC-7, 
domini...@digital.cabinet-office.gov.uk wrote:
>
> We are trying to debug a poorly performing node application and would 
> appreciate any help or advice from this community. We have a node 
> application that serves as the user facing frontend for a payment platform 
> - code here https://github.com/alphagov/pay-frontend. We are in the 
> process of assessing and expanding our capacity to meet increasing need. 
> We have a target of being able to serve X payment journeys per second. 
> A payment journey comprises 4 pages, two of which require a form 
> submission.
> Each page in the journey entails some communication between the node 
> application in question (that we helpfully call frontend) and other 
> microservices to establish the current status of the payment etc, on 
> average around 2 http calls per page.
> By carrying out performance tests (using Gatling) we have found that in 
> order to meet our target of X tx/s, we have to provision around X/2 
> frontend nodes, i.e. each frontend node appears capable of processing 
> around 2 payment journeys per second on average.
> This seems wrong - by my reckoning it is wrong by orders of magnitude.
>
> *Details about our tech stack*
> We are on aws, and the frontends run in docker containers on C5.large ec2 
> instances.
> We use https internally
> We are running node 8 in production
> The application is an express app
> We use http.request to make downstream requests, but have also 
> experimented with using request, with no appreciable difference.
> There is no major cpu heavy processes in our frontend app, and event loop 
> latency under normal load is fine
>
> *What we have found so far*
> The frontend nodes are CPU bound
> Under strain/near breaking point, profiling reveals the frontends seem to 
> be spending a large amount of time doing things related to making 
> downstream http requests, but nothing obviously ludicrous. 
> Whilst there is no obvious memory leak, the heap dump deltas show a 
> proportionately large number of Sockets hanging around - I think this is 
> just due to keepalives though
> Even not under heavy load, the network latency for a request seems high 
> for an internal request - we are seeing average latency of ~20-40ms, vs 
> around 2-5ms for a Java app that is more or less identical in the calls 
> it's making.
> Break down of the phases of a request (gained from request library's 
> timing facility) reveals that under low load on average socket wait, dns 
> lookup and tcp connection take practically no time - bulk of time is 
> waiting for server response
> Under load it appears to be the time to establish a tcp connection and the 
> time to 'firstByte' that contribute to overall increase in http request time
>
> *Things we have tried*
> We have tried configuring the standard agent with different values of 
> maxSockets, maxFreeSockets...
> We have tried using different agents 
> We have tried disabling socket pooling entirely
> We have tried two different client libs - the core http module, and 
> request.
> We have matched the number of workers in our cluster to the number of CPUs
>
> Some of these things have yielded gains of ~10%, but I am still convinced 
> there is something fundamentally wrong with the architecture and 
> configuration of the application - the throughput just seems too low.
>
> I realise I haven't given enough detail to solve anything here, but if 
> anyone has any guidance on approaches that have worked for them, other 
> knobs to twiddle, guidance on better interpretation of profiling and heap 
> dumps, or any other useful pointers I would be very grateful.
>
> Dom
>
>

-- 
Job board: http://jobs.nodejs.org/
New group rules: 
https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to nodejs+unsubscr...@googlegroups.com.
To post to this group, send email to nodejs@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/nodejs/916c55b6-e3e4-4574-ac99-f9a751ab66e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to