Thomas Tauber-Marshall created IMPALA-9436:
----------------------------------------------

             Summary: impala-shell is very slow for large query text sizes
                 Key: IMPALA-9436
                 URL: https://issues.apache.org/jira/browse/IMPALA-9436
             Project: IMPALA
          Issue Type: Improvement
          Components: Clients
    Affects Versions: Impala 3.4.0
            Reporter: Thomas Tauber-Marshall


In working on better support for large sql queries in IMPALA-9414, I found that 
impala-shell is very slow at processing large query sizes.

To test this, I generated a sql file of 1MB that refers to a non-existent table 
(so that the time to run the query would be negligible). Running this query 
file with impala-shell on my local machine takes about 20s, of which about 13s 
are spent in parse_query_text(), which uses some sqlparse functions to try to 
split the query text into multiple queries.

This seems like an unreasonable overhead and could definitely be improved. Some 
ideas for how to do that:
1. Be more clever with our use of sqlparse to get better perf. This probably 
has limited value (eg. strip_comments() already tries to be very clever but is 
still pretty slow)
2. Find a different python library for sql parsing that is faster (this may not 
exist).
3. Add some C++ into the shell instead of always doing everything in pure 
python (not sure how easy/convenient this is to integrate with the shell 
packaging)
4. Try to write our own sql parsing code, which could be optimized for the 
small number of things we need actually need, eg. we don't need full 
tokenization just splitting of multiple queries (likely to be bug-prone)
5. Do some simple hacks, such as skipping the query splitting entirely if there 
isn't a ';' in the query text (this would leave some unfortunate perf cliffs, 
eg. add a ';' to a string literal in your query and suddenly everything gets a 
lot slower)
6. Add an interface in Impala that allows submitting of multiple queries at 
once, eg ExecuteStatements(), which returns a list of query_ids. (might be a 
lot of work to modify impala-server, the parser, etc. to support this)
7. Add an interface in Impala that allows submitting of query text, then parses 
it and returns it in split form without actually executing it, which would 
limit the amount of changes needed vs. option 6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to