[ 
https://issues.apache.org/jira/browse/MADLIB-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1379:
------------------------------------
    Fix Version/s:     (was: v1.17)
                   v2.0

> Fix madpack slowness by using a single session
> ----------------------------------------------
>
>                 Key: MADLIB-1379
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1379
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Madpack
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Minor
>             Fix For: v2.0
>
>
> In the current implementation of madpack, the python function for sending 
> queries to the database, `run_query` is causing extreme slowness on some 
> platforms, especially during startup and shutdown.
> For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment 
> Greenplum demo cluster:
> ```
> > time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule
> madpack.py: INFO : Detected Greenplum DB version 5.10.0.
> real 0m46.365s
> user 0m0.207s
> sys 0m0.136s
> ```
> It takes madpack 46 seconds just to figure out that there is no module named 
> `nosuchmodule` in madlib, so it can't run install-check on it.
> It also takes about 46 seconds to run a single install-check sql file that 
> *does* exist but only contains the line `SELECT 1`.  The general pattern is 
> that `madpack` will hang for a long time, then run `install-check` or 
> `dev-check`, report PASS or FAIL's, then hang for a long time again, and then 
> finally exit.
> This makes it almost unusable for rapid iterative development, for example 
> when writing test files and checking whether they still PASS when minor 
> changes are made.
> I'm not sure how many platforms are affected by this slowdown, but I have 
> noticed it happens on my MacBook Pro's with greenplum but not postgres.
> The main reason for the slowness is that there are a series of setup sql 
> commands issued one by one to `run_query`, and for each command (`DROP USER`, 
> `CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.).  Instead of using a single 
> connection to the db, each call to `run_query` opens a new connection and 
> then closes it.  In particular, it spawns an entirely new child process, 
> redirects the input and output streams, constructs a `psql` command filled 
> with all the right command-line options, just to execute the single line of 
> sql.  
> On `postgres`, may be a little inefficient, but it doesn't matter too much 
> because there is just a single machine with a single instance running.  
> Probably slows it down a little, but hardly noticeable. 
> However, on `greenplum` the situation is far worse.  Each time `run_query` 
> creates a new `psql` client session, it first connects to the master postgres 
> instance.  The master spawns a new child postgres process to handle that 
> particular client.  It then signals all of the segments to create new child 
> postgres processes on all the remote hosts to also handle that specific 
> client.  This involves sending encrypted network traffic back and forth 
> between every host in the cluster over an ssh connection, followed by each of 
> those hosts spawning as many new postgres processes as there are segments per 
> host.
> The 46-second delay I've seen is for only a single machine with 1 master and 
> 5 segments, so I think that means we could expect delays of several minutes 
> or more on multi-host clusters.
> I have verified that with greenplum on OSX, each time I open `psql` there is 
> a 5 second delay during the first sql command issued that involves segments, 
> but no delay for any commands that come after that since all of the extra 
> postgres processes have already been created.  Because of how `run_query` is 
> written, that necessitates a `5-second` delay for every SQL command issued, 
> and it issues at least a dozen or so, only one of them being the actual 
> execution of the test script.
> An easy solution to this is to rewrite `run_query` so that it uses a single 
> client connection instead of spawning a new client in a new child process for 
> each query.  This is only a few lines using `psycopg`.  But if we don't want 
> to have that as a requirement, we could have it start running `psql` the 
> first time, but save the input and output file descriptors so that with 
> subsequent queries it can send data back and forth to psql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to