Domino Valdano created MADLIB-1379:
--------------------------------------

             Summary: Fix madpack slowness by using a single session
                 Key: MADLIB-1379
                 URL: https://issues.apache.org/jira/browse/MADLIB-1379
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Madpack
            Reporter: Domino Valdano


In the current implementation of madpack, the python function for sending 
queries to the database, `run_query` is causing extreme slowness on some 
platforms, especially during startup and shutdown.

For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment Greenplum 
demo cluster:

```

> time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule

madpack.py: INFO : Detected Greenplum DB version 5.10.0.

real 0m46.365s
user 0m0.207s
sys 0m0.136s

```

It takes madpack 46 seconds just to figure out that there is no module named 
`nosuchmodule` in madlib, so it can't run install-check on it.

It also takes about 46 seconds to run a single install-check sql file that 
*does* exist but only contains the line `SELECT 1`.  The general pattern is 
that `madpack` will hang for a long time, then run `install-check` or 
`dev-check`, report PASS or FAIL's, then hang for a long time again, and then 
finally exit.

This makes it almost unusable for rapid iterative development, for example when 
writing test files and checking whether they still PASS when minor changes are 
made.

I'm not sure how many platforms are affected by this slowdown, but I have 
noticed it happens on my MacBook Pro's with greenplum but not postgres.

The main reason for the slowness is that there are a series of setup sql 
commands issued one by one to `run_query`, and for each command (`DROP USER`, 
`CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.).  Instead of using a single 
connection to the db, each call to `run_query` opens a new connection and then 
closes it.  In particular, it spawns an entirely new child process, redirects 
the input and output streams, constructs a `psql` command filled with all the 
right command-line options, just to execute the single line of sql.  

On `postgres`, may be a little inefficient, but it doesn't matter too much 
because there is just a single machine with a single instance running.  
Probably slows it down a little, but hardly noticeable. 

However, on `greenplum` the situation is far worse.  Each time `run_query` 
creates a new `psql` client session, it first connects to the master postgres 
instance.  The master spawns a new child postgres process to handle that 
particular client.  It then signals all of the segments to create new child 
postgres processes on all the remote hosts to also handle that specific client. 
 This involves sending encrypted network traffic back and forth between every 
host in the cluster over an ssh connection, followed by each of those hosts 
spawning as many new postgres processes as there are segments per host.

The 46-second delay I've seen is for only a single machine with 1 master and 3 
segments, so I think that means we could expect delays of several minutes or 
more on multi-host clusters.

I have verified that with greenplum on OSX, each time I open `psql` there is a 
3 second delay during the first sql command issued that involves segments, but 
no delay for any commands that come after that since all of the extra postgres 
processes have already been created.  Because of how `run_query` is written, 
that necessitates a `3-second` delay for every SQL command issued, and it 
issues at least a dozen or so, only one of them being the actual execution of 
the test script.

An easy solution to this is to rewrite `run_query` so that it uses a single 
client connection instead of spawning a new client in a new child process for 
each query.  This is only a few lines using `psycopg`.  But if we don't want to 
have that as a requirement, we could have it start running `psql` the first 
time, but save the input and output file descriptors so that with subsequent 
queries it can send data back and forth to psql.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to