[ 
https://issues.apache.org/jira/browse/MADLIB-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domino Valdano updated MADLIB-1379:
-----------------------------------
    Description: 
In the current implementation of madpack, the python function for sending 
queries to the database, `run_query` is causing extreme slowness on some 
platforms, especially during startup and shutdown.

For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment Greenplum 
demo cluster:

```

> time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule

madpack.py: INFO : Detected Greenplum DB version 5.10.0.

real 0m46.365s
user 0m0.207s
sys 0m0.136s

```

It takes madpack 46 seconds just to figure out that there is no module named 
`nosuchmodule` in madlib, so it can't run install-check on it.

It also takes about 46 seconds to run a single install-check sql file that 
*does* exist but only contains the line `SELECT 1`.  The general pattern is 
that `madpack` will hang for a long time, then run `install-check` or 
`dev-check`, report PASS or FAIL's, then hang for a long time again, and then 
finally exit.

This makes it almost unusable for rapid iterative development, for example when 
writing test files and checking whether they still PASS when minor changes are 
made.

I'm not sure how many platforms are affected by this slowdown, but I have 
noticed it happens on my MacBook Pro's with greenplum but not postgres.

The main reason for the slowness is that there are a series of setup sql 
commands issued one by one to `run_query`, and for each command (`DROP USER`, 
`CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.).  Instead of using a single 
connection to the db, each call to `run_query` opens a new connection and then 
closes it.  In particular, it spawns an entirely new child process, redirects 
the input and output streams, constructs a `psql` command filled with all the 
right command-line options, just to execute the single line of sql.  

On `postgres`, may be a little inefficient, but it doesn't matter too much 
because there is just a single machine with a single instance running.  
Probably slows it down a little, but hardly noticeable. 

However, on `greenplum` the situation is far worse.  Each time `run_query` 
creates a new `psql` client session, it first connects to the master postgres 
instance.  The master spawns a new child postgres process to handle that 
particular client.  It then signals all of the segments to create new child 
postgres processes on all the remote hosts to also handle that specific client. 
 This involves sending encrypted network traffic back and forth between every 
host in the cluster over an ssh connection, followed by each of those hosts 
spawning as many new postgres processes as there are segments per host.

The 46-second delay I've seen is for only a single machine with 1 master and 5 
segments, so I think that means we could expect delays of several minutes or 
more on multi-host clusters.

I have verified that with greenplum on OSX, each time I open `psql` there is a 
5 second delay during the first sql command issued that involves segments, but 
no delay for any commands that come after that since all of the extra postgres 
processes have already been created.  Because of how `run_query` is written, 
that necessitates a `5-second` delay for every SQL command issued, and it 
issues at least a dozen or so, only one of them being the actual execution of 
the test script.

An easy solution to this is to rewrite `run_query` so that it uses a single 
client connection instead of spawning a new client in a new child process for 
each query.  This is only a few lines using `psycopg`.  But if we don't want to 
have that as a requirement, we could have it start running `psql` the first 
time, but save the input and output file descriptors so that with subsequent 
queries it can send data back and forth to psql.

  was:
In the current implementation of madpack, the python function for sending 
queries to the database, `run_query` is causing extreme slowness on some 
platforms, especially during startup and shutdown.

For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment Greenplum 
demo cluster:

```

> time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule

madpack.py: INFO : Detected Greenplum DB version 5.10.0.

real 0m46.365s
user 0m0.207s
sys 0m0.136s

```

It takes madpack 46 seconds just to figure out that there is no module named 
`nosuchmodule` in madlib, so it can't run install-check on it.

It also takes about 46 seconds to run a single install-check sql file that 
*does* exist but only contains the line `SELECT 1`.  The general pattern is 
that `madpack` will hang for a long time, then run `install-check` or 
`dev-check`, report PASS or FAIL's, then hang for a long time again, and then 
finally exit.

This makes it almost unusable for rapid iterative development, for example when 
writing test files and checking whether they still PASS when minor changes are 
made.

I'm not sure how many platforms are affected by this slowdown, but I have 
noticed it happens on my MacBook Pro's with greenplum but not postgres.

The main reason for the slowness is that there are a series of setup sql 
commands issued one by one to `run_query`, and for each command (`DROP USER`, 
`CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.).  Instead of using a single 
connection to the db, each call to `run_query` opens a new connection and then 
closes it.  In particular, it spawns an entirely new child process, redirects 
the input and output streams, constructs a `psql` command filled with all the 
right command-line options, just to execute the single line of sql.  

On `postgres`, may be a little inefficient, but it doesn't matter too much 
because there is just a single machine with a single instance running.  
Probably slows it down a little, but hardly noticeable. 

However, on `greenplum` the situation is far worse.  Each time `run_query` 
creates a new `psql` client session, it first connects to the master postgres 
instance.  The master spawns a new child postgres process to handle that 
particular client.  It then signals all of the segments to create new child 
postgres processes on all the remote hosts to also handle that specific client. 
 This involves sending encrypted network traffic back and forth between every 
host in the cluster over an ssh connection, followed by each of those hosts 
spawning as many new postgres processes as there are segments per host.

The 46-second delay I've seen is for only a single machine with 1 master and 3 
segments, so I think that means we could expect delays of several minutes or 
more on multi-host clusters.

I have verified that with greenplum on OSX, each time I open `psql` there is a 
3 second delay during the first sql command issued that involves segments, but 
no delay for any commands that come after that since all of the extra postgres 
processes have already been created.  Because of how `run_query` is written, 
that necessitates a `3-second` delay for every SQL command issued, and it 
issues at least a dozen or so, only one of them being the actual execution of 
the test script.

An easy solution to this is to rewrite `run_query` so that it uses a single 
client connection instead of spawning a new client in a new child process for 
each query.  This is only a few lines using `psycopg`.  But if we don't want to 
have that as a requirement, we could have it start running `psql` the first 
time, but save the input and output file descriptors so that with subsequent 
queries it can send data back and forth to psql.


> Fix madpack slowness by using a single session
> ----------------------------------------------
>
>                 Key: MADLIB-1379
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1379
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Madpack
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Minor
>             Fix For: v1.17
>
>
> In the current implementation of madpack, the python function for sending 
> queries to the database, `run_query` is causing extreme slowness on some 
> platforms, especially during startup and shutdown.
> For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment 
> Greenplum demo cluster:
> ```
> > time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule
> madpack.py: INFO : Detected Greenplum DB version 5.10.0.
> real 0m46.365s
> user 0m0.207s
> sys 0m0.136s
> ```
> It takes madpack 46 seconds just to figure out that there is no module named 
> `nosuchmodule` in madlib, so it can't run install-check on it.
> It also takes about 46 seconds to run a single install-check sql file that 
> *does* exist but only contains the line `SELECT 1`.  The general pattern is 
> that `madpack` will hang for a long time, then run `install-check` or 
> `dev-check`, report PASS or FAIL's, then hang for a long time again, and then 
> finally exit.
> This makes it almost unusable for rapid iterative development, for example 
> when writing test files and checking whether they still PASS when minor 
> changes are made.
> I'm not sure how many platforms are affected by this slowdown, but I have 
> noticed it happens on my MacBook Pro's with greenplum but not postgres.
> The main reason for the slowness is that there are a series of setup sql 
> commands issued one by one to `run_query`, and for each command (`DROP USER`, 
> `CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.).  Instead of using a single 
> connection to the db, each call to `run_query` opens a new connection and 
> then closes it.  In particular, it spawns an entirely new child process, 
> redirects the input and output streams, constructs a `psql` command filled 
> with all the right command-line options, just to execute the single line of 
> sql.  
> On `postgres`, may be a little inefficient, but it doesn't matter too much 
> because there is just a single machine with a single instance running.  
> Probably slows it down a little, but hardly noticeable. 
> However, on `greenplum` the situation is far worse.  Each time `run_query` 
> creates a new `psql` client session, it first connects to the master postgres 
> instance.  The master spawns a new child postgres process to handle that 
> particular client.  It then signals all of the segments to create new child 
> postgres processes on all the remote hosts to also handle that specific 
> client.  This involves sending encrypted network traffic back and forth 
> between every host in the cluster over an ssh connection, followed by each of 
> those hosts spawning as many new postgres processes as there are segments per 
> host.
> The 46-second delay I've seen is for only a single machine with 1 master and 
> 5 segments, so I think that means we could expect delays of several minutes 
> or more on multi-host clusters.
> I have verified that with greenplum on OSX, each time I open `psql` there is 
> a 5 second delay during the first sql command issued that involves segments, 
> but no delay for any commands that come after that since all of the extra 
> postgres processes have already been created.  Because of how `run_query` is 
> written, that necessitates a `5-second` delay for every SQL command issued, 
> and it issues at least a dozen or so, only one of them being the actual 
> execution of the test script.
> An easy solution to this is to rewrite `run_query` so that it uses a single 
> client connection instead of spawning a new client in a new child process for 
> each query.  This is only a few lines using `psycopg`.  But if we don't want 
> to have that as a requirement, we could have it start running `psql` the 
> first time, but save the input and output file descriptors so that with 
> subsequent queries it can send data back and forth to psql.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to