> SQL 1= "select A from table where G="value"" > SQL 2= "select A from table where G="value" order by Z" Note that SQL 2 does not only add an ORDER BY, but also another column to read, Z, so 3 columns have to be read instead of 2. This leads to extra work for both Impala and Kudu (it is a column oriented storage engine, so has to read more data if there are more columns).
A fairer way to test the effect of order by would be to always add Z to the query, e.g. select A, Z from table where G="value" select A, Z from table where G="value" order by Z On Thu, Oct 31, 2019 at 3:51 AM Tim Armstrong <[email protected]> wrote: > > 1) Why is Impala slow with order by? > Impala's sort implementation is generally very fast, but the bottleneck > for a query can be in many different places. > > Like Shant said, the query profile will contain all the info needed to > determine where time was spent in the query. We don't have enough > information now to suggest why your query is performing the way it is. I'm > surprised that a sort operation on a single row would make that much > difference in query runtime. > > > 2) Can order by be made faster in clustered mode, that mean made to > be parallelized ? > Yes, sorts are parallelised across the cluster, *except* for the final > merge stage where the rows are collected and streamed back to the client. > > If you want to play around with increasing the parallelism of the sort, > there is a query option mt_dop that switches to a different multithreading > mode where you can increase the number of threads executing the plan, > including sort. E.g. if you set mt_dop=4 then on each node you will get 4 > threads doing the sort (assuming there are enough input files that the work > can be partitioned). > > > 3) Is it a good idea to use order by with Impala? if so have any body > use it with a larger data set with good performance. > If you're running a SELECT with an ORDER BY that returns a lot of rows, > the bottleneck is most likely the output - Impala can probably sort the > rows much faster than the client can fetch them. > > This actually makes it kinda hard to benchmark sort performance. We have > some targeted benchmarks that use analytic functions to test sort > performance to avoid the problem (it works because the query plans for > analytic functions involve a sort). See > https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test > > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > This is exactly what Impala is good at. > > On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian <[email protected]> > wrote: > >> Hi Dinushka! >> >> That's awesome you're working with Apache Impala for your internship! You >> should know the Impala website provides some incredible documentation. In >> particular check out the performance section to learn about how to >> benchmark queries. >> https://impala.apache.org/docs/build/html/topics/impala_performance.html >> >> As you mentioned since it's in "standalone mode" all sorts of things can >> be going, however the easiest way to get some insights to do look at the >> explain plan and profile for the queries you run. For examples take a look >> at this page >> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html >> >> Any chance you can share the EXPLAIN and PROFILE statement output of your >> queries? You should also take a look at the output of the SUMMARY statement >> for yourself in case anything obvious stands out to you. >> >> Thanks! >> >> On Wed, Oct 30, 2019 at 9:10 PM Dinushka <[email protected]> wrote: >> >>> Hi, >>> >>> I'm a student doing an internship, I have been given a task to do DB >>> performance testing for kudu with Impala for our data and use case. >>> >>> Sample dataset is about 150 million records with 150 columns and total >>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by >>> hash (X =4,Y=2,Z=2) >>> >>> SQL 1= "select A from table where G="value"" >>> SQL 2= "select A from table where G="value" order by Z" >>> >>> I'm testing kudu and Impala in standalone mode and have 2 queries which >>> will only return one row. One with "order by" and other without "order >>> by". >>> >>> When I do testing, I found that Impala with order by is about 15% to >>> 35% slow. when you have order by in the SQL. >>> >>> In large row counts queries, it's time can be about 2-20 times more. >>> >>> 1) Why is Impala slow with order by? >>> >>> 2) Can order by be made faster in clustered mode, that mean made to be >>> parallelized >>> ? >>> >>> 3) Is it a good idea to use order by with Impala? if so have any body >>> use it with a larger data set with good performance. >>> >>> 4) Is there any other solutions to do fast order by queries within few >>> seconds. (Interactive query engines) >>> >>> >>> Thank you >>> >>
