SELECT + ORDER BY vs self-join

Marcin Januszkiewicz Mon, 30 Oct 2017 07:02:47 -0700

We have a wide table with 100M records created with the following DDL:

CREATE TABLE traces (
  rowkey VARCHAR PRIMARY KEY,
  time VARCHAR,
  number VARCHAR,
  +40 more columns)


We want to select a large (~30M records) subset of this data with the query:

SELECT *all columns*
  FROM traces
  WHERE (UPPER(number) LIKE 'PO %')
  ORDER BY time DESC, ROWKEY
  LIMIT 101;

This times out after 15 minutes and puts a huge load on our cluster.
We have an alternate way of selecting this data:

SELECT t.rowkey, *all columns*
FROM TRACES t
JOIN (
  SELECT rowkey
  FROM TRACES
  WHERE (UPPER(number) LIKE 'PO %')
  ORDER BY time DESC, ROWKEY
  LIMIT 101
) ix
ON t.ROWKEY = ix.ROWKEY
order by t.ROWKEY;

Which completes in just under a minute.
Is there a better way to construct this query?
When is using the self-join a worse choice than the simple select?
Given that we have a functional index on UPPER(number), could this
potentially be a statistics-based optimizer decision?

-- 
Pozdrawiam,
Marcin Januszkiewicz

SELECT + ORDER BY vs self-join

Reply via email to