[
https://issues.apache.org/jira/browse/MADLIB-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1379:
------------------------------------
Priority: Minor (was: Major)
> Fix madpack slowness by using a single session
> ----------------------------------------------
>
> Key: MADLIB-1379
> URL: https://issues.apache.org/jira/browse/MADLIB-1379
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Madpack
> Reporter: Domino Valdano
> Assignee: Domino Valdano
> Priority: Minor
> Fix For: v1.17
>
>
> In the current implementation of madpack, the python function for sending
> queries to the database, `run_query` is causing extreme slowness on some
> platforms, especially during startup and shutdown.
> For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment
> Greenplum demo cluster:
> ```
> > time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule
> madpack.py: INFO : Detected Greenplum DB version 5.10.0.
> real 0m46.365s
> user 0m0.207s
> sys 0m0.136s
> ```
> It takes madpack 46 seconds just to figure out that there is no module named
> `nosuchmodule` in madlib, so it can't run install-check on it.
> It also takes about 46 seconds to run a single install-check sql file that
> *does* exist but only contains the line `SELECT 1`. The general pattern is
> that `madpack` will hang for a long time, then run `install-check` or
> `dev-check`, report PASS or FAIL's, then hang for a long time again, and then
> finally exit.
> This makes it almost unusable for rapid iterative development, for example
> when writing test files and checking whether they still PASS when minor
> changes are made.
> I'm not sure how many platforms are affected by this slowdown, but I have
> noticed it happens on my MacBook Pro's with greenplum but not postgres.
> The main reason for the slowness is that there are a series of setup sql
> commands issued one by one to `run_query`, and for each command (`DROP USER`,
> `CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.). Instead of using a single
> connection to the db, each call to `run_query` opens a new connection and
> then closes it. In particular, it spawns an entirely new child process,
> redirects the input and output streams, constructs a `psql` command filled
> with all the right command-line options, just to execute the single line of
> sql.
> On `postgres`, may be a little inefficient, but it doesn't matter too much
> because there is just a single machine with a single instance running.
> Probably slows it down a little, but hardly noticeable.
> However, on `greenplum` the situation is far worse. Each time `run_query`
> creates a new `psql` client session, it first connects to the master postgres
> instance. The master spawns a new child postgres process to handle that
> particular client. It then signals all of the segments to create new child
> postgres processes on all the remote hosts to also handle that specific
> client. This involves sending encrypted network traffic back and forth
> between every host in the cluster over an ssh connection, followed by each of
> those hosts spawning as many new postgres processes as there are segments per
> host.
> The 46-second delay I've seen is for only a single machine with 1 master and
> 3 segments, so I think that means we could expect delays of several minutes
> or more on multi-host clusters.
> I have verified that with greenplum on OSX, each time I open `psql` there is
> a 3 second delay during the first sql command issued that involves segments,
> but no delay for any commands that come after that since all of the extra
> postgres processes have already been created. Because of how `run_query` is
> written, that necessitates a `3-second` delay for every SQL command issued,
> and it issues at least a dozen or so, only one of them being the actual
> execution of the test script.
> An easy solution to this is to rewrite `run_query` so that it uses a single
> client connection instead of spawning a new client in a new child process for
> each query. This is only a few lines using `psycopg`. But if we don't want
> to have that as a requirement, we could have it start running `psql` the
> first time, but save the input and output file descriptors so that with
> subsequent queries it can send data back and forth to psql.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)