We have a wide table with 100M records created with the following DDL: CREATE TABLE traces ( rowkey VARCHAR PRIMARY KEY, time VARCHAR, number VARCHAR, +40 more columns)
We want to select a large (~30M records) subset of this data with the query: SELECT *all columns* FROM traces WHERE (UPPER(number) LIKE 'PO %') ORDER BY time DESC, ROWKEY LIMIT 101; This times out after 15 minutes and puts a huge load on our cluster. We have an alternate way of selecting this data: SELECT t.rowkey, *all columns* FROM TRACES t JOIN ( SELECT rowkey FROM TRACES WHERE (UPPER(number) LIKE 'PO %') ORDER BY time DESC, ROWKEY LIMIT 101 ) ix ON t.ROWKEY = ix.ROWKEY order by t.ROWKEY; Which completes in just under a minute. Is there a better way to construct this query? When is using the self-join a worse choice than the simple select? Given that we have a functional index on UPPER(number), could this potentially be a statistics-based optimizer decision? -- Pozdrawiam, Marcin Januszkiewicz
