Re: [Maria-developers] Performance of index intersection

2019-10-01 Thread David Sickmiller
Thanks so much for looking into this!

Last weekend I started playing with the code, but it looks like I may need
to upgrade to an SSD to get a reasonable build time.

For Zone B I was imagining there would be a way to try increasingly large
jumps, perhaps aided by an index cursor that remembered some information it
saw as it walked through the B*-tree and page directory.

I'm happy to hear an optimization to skip over Zone A looks fairly easy.

As for real data, do full-text indexes return rowid-ordered rows? If so,
queries for a long-tail word (e.g. "hobbit") AND a frequent value (e.g.
soft_deleted=0) should benefit significantly from skipping over Zone A.
Maybe we could use a Wikipedia dump?

Thank you,
David




On Wed, 2 Oct 2019, 8:51 AM jocelyn fournier, 
wrote:

> Hi Sergey!
>
> Actually fetching min & max value for each keys (key1=1 and key2=2 in your
> case) should not cost much?
> Then you can compute the overlap zone (Zone B in your graph) and directly
> perform a jump on the beginning of Zone B and stop at the end of Zone B
> instead of EOF, using the index with the less amount of matching rows.
>
> BR,
>   Jocelyn
>
>
> > Le 1 oct. 2019 à 21:31, Sergey Petrunia  a écrit :
> >
> > Hello,
> >
> > More thoughts on this:
> >
> > Let's consider a query
> >
> > select * from t1 where key1=1 and key2=2
> >
> > let's assume it uses index itnersection.
> >
> > Check the attached picture. It shows the table rowids at the center,
> > key1 on the left, key2 on the right.
> >
> > There are three zones:
> >
> > - Zone A, where rowids matching 'key1=1' have no overlap with any rowids
> for 'key2=2'
> >
> > - Zone B, where rowids from two scans overlap.
> >
> > - Zone C, similar to zone A but at the end of the scan. Here, rowids
> matching
> > key2=2 have no overlap.
> >
> > The current "merge-ordered-streams" algorithm takes care of "Zone C".
> > the code will return EOF as soon as one of the merged streams has
> produced EOF.
> >
> > Now, let's look at Zone B.  Here, the code would scan both indexes
> forward
> > and search for records with key=1 that have matches with key=2.
> >
> > The search forward is this loop in QUICK_ROR_INTERSECT_SELECT::get_next:
> >
> >do
> >{
> >  DBUG_EXECUTE_IF("innodb_quick_report_deadlock",
> >  DBUG_SET("+d,innodb_report_deadlock"););
> >  if (unlikely((error= quick->get_next(
> >  {
> >/* On certain errors like deadlock, trx might be rolled back.*/
> >if (!thd->transaction_rollback_request)
> >  quick_with_last_rowid->file->unlock_row();
> >DBUG_RETURN(error);
> >  }
> >  quick->file->position(quick->record);
> >  cmp= head->file->cmp_ref(quick->file->ref, last_rowid);
> >  if (cmp < 0)
> >  {
> >/* This row is being skipped.  Release lock on it. */
> >quick->file->unlock_row();
> >  }
> >} while (cmp < 0);
> >
> >
> > the quick->get_next() call (after several levels of indirection) do this
> > storage engine API call:
> >
> >  handler->index_next_same();
> >
> > And the question is whether it would be faster if this call was made
> instead:
> >
> >  handler->index_read_map(key_value, rowid_of_match_candidate);
> >
> > The answer is that it's faster to do a jump when we will jump over many
> > records. When we jump over a few, it is actually slower than making
> several
> > index_next_same() calls.  The first reason for this is that doing seeks
> will
> > break InnoDB's prefetch buffer optimization. There might be other
> reasons.
> >
> > I don't have any idea how one could predict the jump size.
> >
> > As for Zone A, we could jump it over with one jump. If we first read the
> rowid
> > from "key2=2", then we could already use it when making a lookup with
> key1=1.
> >
> > But what if we first did an index lookup on key1=1? Then we will miss the
> > chance to jump to the first rowid for that one sees for key2=2...
> >
> > If we assume that the first jump is more important than others, we could
> just
> > look at the first record in each of the merged indexes and then feed
> them all
> > back the maximum rowid we saw as the starting point. This would let us
> jump
> > over zone A.
> >
> > I think, implementing "jump over zone A" is easier than inventing a way
> to do
> > jumps while in zone B. (And this is actually what your idea was).
> >
> >
> > A different question - are there any example datasets or queries we
> > could try this on? One can of course construct an artificial dataset and
> > queries but it would be much nicer to try this on a real dataset and a
> real
> > query.
> >
> > BR
> > -- Sergei P.
> >
> > On Fri, Sep 20, 2019 at 12:31:10PM +0300, Sergey Petrunia wrote:
> >> Hello David,
> >>
> >> On Mon, Sep 16, 2019 at 09:07:09PM +1200, David Sickmiller wrote:
> >>> I've been using MySQL/MariaDB for two decades but have more recently
> been
> >>> working with Elasticsearch.  I knew to expect an inverted index to
> speed up
> >>> querying full 

Re: [Maria-developers] Performance of index intersection

2019-10-01 Thread jocelyn fournier
Hi Sergey!

Actually fetching min & max value for each keys (key1=1 and key2=2 in your 
case) should not cost much?
Then you can compute the overlap zone (Zone B in your graph) and directly 
perform a jump on the beginning of Zone B and stop at the end of Zone B instead 
of EOF, using the index with the less amount of matching rows.

BR,
  Jocelyn


> Le 1 oct. 2019 à 21:31, Sergey Petrunia  a écrit :
> 
> Hello,
> 
> More thoughts on this:
> 
> Let's consider a query 
> 
> select * from t1 where key1=1 and key2=2
> 
> let's assume it uses index itnersection.
> 
> Check the attached picture. It shows the table rowids at the center,
> key1 on the left, key2 on the right.
> 
> There are three zones:
> 
> - Zone A, where rowids matching 'key1=1' have no overlap with any rowids for 
> 'key2=2'
> 
> - Zone B, where rowids from two scans overlap.
> 
> - Zone C, similar to zone A but at the end of the scan. Here, rowids matching 
> key2=2 have no overlap.
> 
> The current "merge-ordered-streams" algorithm takes care of "Zone C".
> the code will return EOF as soon as one of the merged streams has produced 
> EOF.
> 
> Now, let's look at Zone B.  Here, the code would scan both indexes forward
> and search for records with key=1 that have matches with key=2.
> 
> The search forward is this loop in QUICK_ROR_INTERSECT_SELECT::get_next:
> 
>do
>{
>  DBUG_EXECUTE_IF("innodb_quick_report_deadlock",
>  DBUG_SET("+d,innodb_report_deadlock"););
>  if (unlikely((error= quick->get_next(
>  {
>/* On certain errors like deadlock, trx might be rolled back.*/
>if (!thd->transaction_rollback_request)
>  quick_with_last_rowid->file->unlock_row();
>DBUG_RETURN(error);
>  }
>  quick->file->position(quick->record);
>  cmp= head->file->cmp_ref(quick->file->ref, last_rowid);
>  if (cmp < 0)
>  {
>/* This row is being skipped.  Release lock on it. */
>quick->file->unlock_row();
>  }
>} while (cmp < 0);
> 
> 
> the quick->get_next() call (after several levels of indirection) do this
> storage engine API call:
> 
>  handler->index_next_same();
> 
> And the question is whether it would be faster if this call was made instead:
> 
>  handler->index_read_map(key_value, rowid_of_match_candidate);
> 
> The answer is that it's faster to do a jump when we will jump over many 
> records. When we jump over a few, it is actually slower than making several
> index_next_same() calls.  The first reason for this is that doing seeks will
> break InnoDB's prefetch buffer optimization. There might be other reasons.
> 
> I don't have any idea how one could predict the jump size.
> 
> As for Zone A, we could jump it over with one jump. If we first read the rowid
> from "key2=2", then we could already use it when making a lookup with key1=1.
> 
> But what if we first did an index lookup on key1=1? Then we will miss the
> chance to jump to the first rowid for that one sees for key2=2...
> 
> If we assume that the first jump is more important than others, we could just 
> look at the first record in each of the merged indexes and then feed them all
> back the maximum rowid we saw as the starting point. This would let us jump
> over zone A.
> 
> I think, implementing "jump over zone A" is easier than inventing a way to do 
> jumps while in zone B. (And this is actually what your idea was).
> 
> 
> A different question - are there any example datasets or queries we
> could try this on? One can of course construct an artificial dataset and
> queries but it would be much nicer to try this on a real dataset and a real
> query.
> 
> BR
> -- Sergei P.
> 
> On Fri, Sep 20, 2019 at 12:31:10PM +0300, Sergey Petrunia wrote:
>> Hello David,
>> 
>> On Mon, Sep 16, 2019 at 09:07:09PM +1200, David Sickmiller wrote:
>>> I've been using MySQL/MariaDB for two decades but have more recently been
>>> working with Elasticsearch.  I knew to expect an inverted index to speed up
>>> querying full text fields, but I've been surprised (and a bit annoyed) at
>>> how fast ES can query structured data.  (In my case, I'm largely looking
>>> for intersections of a number of varchar fields with lowish cardinality,
>>> e.g. WHERE country = 'US' AND client = 'Microsoft' AND status =
>>> 'Completed'.)
>>> 
>>> Elasticsearch seems to have several things going on, but I think a core
>>> aspect, to use RDMS terminology, is that each column is indexed, and index
>>> unions/intersections are used if the WHERE clause references multiple
>>> columns.
>>> 
>>> I've heard that MySQL/MariaDB has the ability to merge indexes, but I've
>>> rarely observed it in person.  Googling for it yields a bunch of
>>> StackOverflow posts complaining how slow it is, with responses agreeing and
>>> explaining how to disable it.
>>> 
>>> If I'm reading the MySQL/MariaDB code correctly, it looks like MariaDB will
>>> intersect indexes by looping through each index, reading the rowids of all
>>> matching 

Re: [Maria-developers] GSoC: On completing the MDEV-6017 (Add support for Indexes on Expressions)

2019-10-01 Thread Oleksandr Byelkin

Hi, Alexey!

Am 01.10.19 um 14:29 schrieb Alexey Mogilyovkin:
Hello, I would like to finish the task I was working on during GSoC. 
Nikita told me that there are some architectural questions to my code. 
Can you tell about them in more detail?


Here you can find list of related MDEVs (as well as this itself) 
https://jira.mariadb.org/browse/MDEV-12326 , so you can check details of 
implementation of INSERT (it is closed, so should be pushed)




___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp




___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp