Re: Impala slow at order by

Csaba Ringhofer Mon, 04 Nov 2019 08:27:45 -0800

> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
Note that SQL 2 does not only add an ORDER BY, but also another column to
read, Z, so 3 columns have to be read instead of 2.
This leads to extra work for both Impala and Kudu (it is a column oriented
storage engine, so has to read more data if there are more columns).


A fairer way to test the effect of order by would be to always add Z to the
query, e.g.
select  A, Z from table where  G="value"
select  A, Z from table where  G="value" order by Z


On Thu, Oct 31, 2019 at 3:51 AM Tim Armstrong <[email protected]>
wrote:

> > 1) Why is Impala slow with order by?
> Impala's sort implementation is generally very fast, but the bottleneck
> for a query can be in many different places.
>
> Like Shant said, the query profile will contain all the info needed to
> determine where time was spent in the query. We don't have enough
> information now to suggest why your query is performing the way it is. I'm
> surprised that a sort operation on a single row would make that much
> difference in query runtime.
>
> > 2) Can order by  be made faster in clustered mode, that mean made to
> be parallelized ?
> Yes, sorts are parallelised across the cluster, *except* for the final
> merge stage where the rows are collected and streamed back to the client.
>
> If you want to play around with increasing the parallelism of the sort,
> there is a query option mt_dop that switches to a different multithreading
> mode where you can increase the number of threads executing the plan,
> including sort. E.g. if you set mt_dop=4 then on each node you will get 4
> threads doing the sort (assuming there are enough input files that the work
> can be partitioned).
>
> > 3) Is it a good idea to use order by with Impala? if so have any body
> use it with a larger data set with good performance.
> If you're running a SELECT with an ORDER BY that returns a lot of rows,
> the bottleneck is most likely the output - Impala can probably sort the
> rows much faster than the client can fetch them.
>
> This actually makes it kinda hard to benchmark sort performance. We have
> some targeted benchmarks that use analytic functions to test sort
> performance to avoid the problem (it works because the query plans for
> analytic functions involve a sort). See
> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
>
> > 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
> This is exactly what Impala is good at.
>
> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian <[email protected]>
> wrote:
>
>> Hi Dinushka!
>>
>> That's awesome you're working with Apache Impala for your internship! You
>> should know the Impala website provides some incredible documentation. In
>> particular check out the performance section to learn about how to
>> benchmark queries.
>> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>>
>> As you mentioned since it's in "standalone mode" all sorts of things can
>> be going, however the easiest way to get some insights to do look at the
>> explain plan and profile for the queries you run. For examples take a look
>> at this page
>> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>>
>> Any chance you can share the EXPLAIN and PROFILE statement output of your
>> queries? You should also take a look at the output of the SUMMARY statement
>> for yourself in case anything obvious stands out to you.
>>
>> Thanks!
>>
>> On Wed, Oct 30, 2019 at 9:10 PM Dinushka <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'm a student doing an internship, I have been given a task to do DB
>>> performance testing for kudu with Impala for our data and use case.
>>>
>>> Sample dataset is about 150 million records with 150 columns  and total
>>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
>>> hash (X =4,Y=2,Z=2)
>>>
>>> SQL 1= "select  A from table where  G="value""
>>> SQL 2= "select  A from table where  G="value" order by Z"
>>>
>>> I'm testing kudu and Impala in standalone mode and have 2 queries which
>>> will only return one row. One with "order by" and other without "order
>>> by".
>>>
>>> When I do testing, I found that Impala with order by is about 15% to
>>> 35% slow. when you have order by in the SQL.
>>>
>>> In large row counts queries, it's time can be about 2-20 times more.
>>>
>>> 1) Why is Impala slow with order by?
>>>
>>> 2) Can order by  be made faster in clustered mode, that mean made to be 
>>> parallelized
>>> ?
>>>
>>> 3) Is it a good idea to use order by with Impala? if so have any body
>>> use it with a larger data set with good performance.
>>>
>>> 4) Is there any other solutions to do fast order by queries within few
>>> seconds. (Interactive query engines)
>>>
>>>
>>> Thank you
>>>
>>

Re: Impala slow at order by

Reply via email to