Created https://issues.apache.org/jira/browse/PHOENIX-4336
On Tue, Oct 31, 2017 at 8:44 AM, Marcin Januszkiewicz <[email protected]> wrote: > Sorry, I forgot to include the DDL for the index: > CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, + > some other columns used for filtering). > I'm going to file a JIRA with the details, but local indexes are being > used in both cases. > My suspicion is that the first query will attempt to load all 30M rows > of data from HBase and then sort it, and the second will only sort 30M > (rowkey, time) pairs. > > On Tue, Oct 31, 2017 at 8:02 AM, Sergey Soldatov > <[email protected]> wrote: >> I agree with James that this happens because the index was not involved >> because it doesn't cover all columns. I believe that in the second case, the >> RHT is using the index to create a list of rowkeys and they are used for >> point lookups by skipscan. >> >> bq. When is using the self-join a worse choice than the simple select? >> >> Hash join has it's own limitations: >> 1. RHT is supposed to be small, so it's better to keep LIMIT small (far >> less than 30 mil). >> 2. Client is always involved to collect data from RHT, build the hash join >> cache and send it to all RSes. >> >> >> bq. Is there a better way to construct this query? >> >> Using local index may help in this case. >> >> Thanks, >> Sergey >> >> >> On Mon, Oct 30, 2017 at 11:26 PM, James Taylor <[email protected]> >> wrote: >>> >>> Please file a JIRA and include the explain plan for each of the queries. I >>> suspect your index is not being used in the first query due to the selection >>> of all the columns. You can try hinting the query to force your index to be >>> used. See https://phoenix.apache.org/secondary_indexing.html#Index_Usage >>> >>> Thanks, >>> James >>> >>> On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz >>> <[email protected]> wrote: >>>> >>>> We have a wide table with 100M records created with the following DDL: >>>> >>>> CREATE TABLE traces ( >>>> rowkey VARCHAR PRIMARY KEY, >>>> time VARCHAR, >>>> number VARCHAR, >>>> +40 more columns) >>>> >>>> We want to select a large (~30M records) subset of this data with the >>>> query: >>>> >>>> SELECT *all columns* >>>> FROM traces >>>> WHERE (UPPER(number) LIKE 'PO %') >>>> ORDER BY time DESC, ROWKEY >>>> LIMIT 101; >>>> >>>> This times out after 15 minutes and puts a huge load on our cluster. >>>> We have an alternate way of selecting this data: >>>> >>>> SELECT t.rowkey, *all columns* >>>> FROM TRACES t >>>> JOIN ( >>>> SELECT rowkey >>>> FROM TRACES >>>> WHERE (UPPER(number) LIKE 'PO %') >>>> ORDER BY time DESC, ROWKEY >>>> LIMIT 101 >>>> ) ix >>>> ON t.ROWKEY = ix.ROWKEY >>>> order by t.ROWKEY; >>>> >>>> Which completes in just under a minute. >>>> Is there a better way to construct this query? >>>> When is using the self-join a worse choice than the simple select? >>>> Given that we have a functional index on UPPER(number), could this >>>> potentially be a statistics-based optimizer decision? >>>> >>>> -- >>>> Pozdrawiam, >>>> Marcin Januszkiewicz >>> >>> >>
