I suspect this problem is similar to PHOENIX-4288. On Mon, Oct 30, 2017 at 11:26 PM James Taylor <[email protected]> wrote:
> Please file a JIRA and include the explain plan for each of the queries. I > suspect your index is not being used in the first query due to the > selection of all the columns. You can try hinting the query to force your > index to be used. See > https://phoenix.apache.org/secondary_indexing.html#Index_Usage > > Thanks, > James > > On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz < > [email protected]> wrote: > >> We have a wide table with 100M records created with the following DDL: >> >> CREATE TABLE traces ( >> rowkey VARCHAR PRIMARY KEY, >> time VARCHAR, >> number VARCHAR, >> +40 more columns) >> >> We want to select a large (~30M records) subset of this data with the >> query: >> >> SELECT *all columns* >> FROM traces >> WHERE (UPPER(number) LIKE 'PO %') >> ORDER BY time DESC, ROWKEY >> LIMIT 101; >> >> This times out after 15 minutes and puts a huge load on our cluster. >> We have an alternate way of selecting this data: >> >> SELECT t.rowkey, *all columns* >> FROM TRACES t >> JOIN ( >> SELECT rowkey >> FROM TRACES >> WHERE (UPPER(number) LIKE 'PO %') >> ORDER BY time DESC, ROWKEY >> LIMIT 101 >> ) ix >> ON t.ROWKEY = ix.ROWKEY >> order by t.ROWKEY; >> >> Which completes in just under a minute. >> Is there a better way to construct this query? >> When is using the self-join a worse choice than the simple select? >> Given that we have a functional index on UPPER(number), could this >> potentially be a statistics-based optimizer decision? >> >> -- >> Pozdrawiam, >> Marcin Januszkiewicz >> > >
