Re: Querying on subsets of one-to-many and many-to-many

Daniel BODEA Wed, 27 Oct 2004 08:38:17 -0700

If the query is applied to a duplicates free set of data and making nolinks
> 0, it does indeed deliver the expected results. Since the counters are not
reused, they can even be used textually in the HAVING clause which doesn't
add any extra columns to the final result set.


Two things through. First of all, the HAVING clause doesn't use any indexes
even though it works with integers which should be pretty fast but I'll
mention this later.

The second is that I may have over simplified things in my previous email.
This type of filter is applied to a query that already has several LEFT and
INNER joins applied to it that in turn do not guarantee in any way the
uniqueness of one row from the original table. Certain rows from the
original table may be duplicated at the moment when this filter is applied
so this type of counters is no longer reliable.

I came up with a query that looks something like this:

select A.* from A left join L using (PK_A)
    group by A.PK_A
    having
        isnull(L.PK_A) or
        (
            sum(L.PK_B = 3) and
            sum(L.PK_B = 4) and
           !sum(L.PK_B in (2))
        )

This query if I'm not mistaken basically does the same thing without
returning extra columns and guaranteeing a correct result set even when
duplicates are encountered.

I haven't tested this on large data sets yet but the only penalty should
come from using the HAVING clause. In this particular case this only implies
working on simple booleans and integers but since I'm not familiar with the
internals of MySQL, I'm not sure if this constitutes a significant work load
given the data figures I mentioned previously.

Originally, I also thought of doing this using user variables but I figured
that would have required ordering the initial data set according to some
predefined criteria which wouldn't have left any possibility of ordering the
results according to however I eventually wanted them.

Daniel

----- Original Message ----- 
From: <[EMAIL PROTECTED]>
To: "Daniel BODEA" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, October 26, 2004 6:41 PM
Subject: Re: Querying on subsets of one-to-many and many-to-many


> OK, now that I have a better model to work with, let me try again. (And
> stop being so discouraged! This is hard for everyone until they get the
> hang of it!!!  =8-D )
>
> > Given a query that resembles this "select * from A where (no links) or
> > ((links to both 3 and 4) and (no links to 2))"
>
> This may not be the only way to do this but it's what I thought of first
>
> SELECT A.PK_A, A.DATA_A, sum(isnull(L.PK_A)) as nolinks,  sum(if(L.PK_B IN
> (3,4),1,0)) as wanted, sum(if(L.PK_B = 2,1,0)) as rejects
> FROM  A
> LEFT JOIN L
>         on L.PK_A = A.PK_A
> GROUP BY 1,2
> HAVING nolinks >=0 or (wanted = 2  and rejects=0);
>
> This returns three more columns that you originally wanted but it's still
> a one-trip query.
>
> Shawn Green
> Database Administrator
> Unimin Corporation - Spruce Pine
>
>
> "Daniel BODEA" <[EMAIL PROTECTED]> wrote on 10/26/2004 11:55:34 AM:
>
> > Many thanks for the quick answer. I apologize for leaving out a few
> details
> > that should have explained the context in greater detail. I'll use less
> > theory below to further explain what's going on and then I'll make some
> > remarks on your answer.
> >
> > Simplified down to the basics of this problem, we have two tables, A and
> B,
> > both having the same structure:
> >
> > PK_A / PK_B - primary key
> > DATA_A / DATA_B - text
> >
> > Then we have table L with two columns:
> >
> > PK_A - 0..n of primary key for table A
> > PK_B - 0..n of primary key for table B
> >
> > Both A and B contain 3 rows with PKs 1, 2, 3. Data is not an issue. L
> > contains the following mappings:
> >
> > PK_A | PK_B
> > 1 | 3
> > 1 | 4
> > 2 | 2
> > 2 | 3
> > 2 | 4
> >
> > Given a query that resembles this "select * from A where (no links) or
> > ((links to both 3 and 4) and (no links to 2))", we should be able to
> work
> > (order, limit, etc.) on records 1 and 3 from A. The PKs that the WHERE
> > clause works on are those of B not A.
> >
> > I also thought at first that a sequence of LEFT and INNER joins would
> both
> > filter the results correctly and work fast enough for this kind of data
> but
> > two days of joining stuff didn't make it happen and frankly I don't
> think
> > there is any "one query" solution.
> >
> > In your first example you filter on PK_A which is my mistake because the
> > selection is actually made on PK_B. I apologize for not making this
> clearer.
> > Using PK_B in your example though is the first trap I fell into during
> those
> > joining days. Whatever your conditions in the WHERE clause, you only
> apply
> > them to one row at a time which means the relevant part of the query is
> > translated into this "where PK_A has links to any of PK_B in (3, 4)".
> The
> > NOT IN part has no effect since the row 2 | 2 is excluded but the other
> two
> > get through.
> >
> > Disregarding PK_A in the WHERE clause and using PK_B instead in the
> second
> > example, I believe first of all the INNER JOIN filters out all rows that
> > don't have any links (NULL rows) and the WHERE conditions abide to the
> same
> > rules as the previous example so the results are the same. In any
> > circumstance, using IN in this context results in the set of indexes
> being
> > ORed not ANDed since IN returns true for any value in the set.
> >
> > I am totally open to any suggestions for "one query" solutions using
> joins
> > but my trying to find the best way to use a temporary or cache table is
> the
> > result of having proven to myself the (disputable of course)
> impossibility
> > of this thing using one trip only.
> >
> > Daniel
> >
> > ----- Original Message ----- 
> > From: <[EMAIL PROTECTED]>
> > To: "Daniel BODEA" <[EMAIL PROTECTED]>
> > Cc: <[EMAIL PROTECTED]>
> > Sent: Tuesday, October 26, 2004 3:41 PM
> > Subject: Re: Querying on subsets of one-to-many and many-to-many
> >
> >
> > > Under most circumstances, even with data tables as large as yours, I
> have
> > > found that direct JOIN statements have provided acceptable
> performance. On
> > > certain highly normalized databases, where the requirements of the
> query
> > > needed more than 4 or 5 tables joined at once, I have also used your
> > > "cache" table idea to great success.
> > >
> > > However I would like to believe that queries on the relationships
> between
> > > just these three tables could be coded by using JOINS to provide some
> very
> > > fast results. Since you chose not to post your real table structures
> here,
> > > I must create these sample queries using your sample tables.
> > >
> > > select * from REC_A where (no links exist at all) or ((links to REC_B
> > > exist
> > > for both PKs 22 and 34) and (no links exist to REC_B for either PK in
> 56
> > > or
> > > 12)
> > >
> > > CREATE TEMPORARY TABLE tmpUnion (primary key(PK_A))
> > > SELECT a.*
> > > FROM REC_A a
> > > LEFT JOIN LNK
> > >         on LNK.PK_A = a.PK_A
> > > WHERE LNK.PK_A is null;
> > >
> > > INSERT IGNORE tmpUnion
> > > SELECT a.*
> > > FROM REC_A a
> > > INNER JOIN LNK
> > >         on LNK.PK_A = a.PK_A
> > > WHERE LNK.PK_A in (22,34)
> > >         AND LNK.PK_A NOT IN (12,56);
> > >
> > > SELECT * from tmpUnion;
> > > DROP TEMPORARY TABLE tmpUnion;
> > >
> > > Whenever you have an OR like that in your WHERE clause, you are in
> essence
> > > running two or more queries and combining the results. Since you are
> on a
> >
> > > pre-4.0 server, you do not have the UNION operator available to help
> you
> > > so I had to simulate one by using a temporary table. Basically I had
> to
> > > hand-code what the UNION operator automates for you. I defined the
> Primary
> > > Key on the temp table and used the IGNORE option on the INSERT
> statement
> > > to prevent duplicate rows (logically you shouldn't have had any but I
> was
> > > being thorough)
> > >
> > > Here is an equivalent query without the temp table
> > >
> > > SELECT DISTINCT a.*
> > > FROM REC_A a
> > > LEFT JOIN LNK  m1
> > >         ON m1.PK_A = a.PK_A
> > > INNER JOIN LNK m2
> > >         ON m2.PK_A = a.PK_A
> > > WHERE m1.PK_A is null
> > >         OR (
> > >                 m2.PK_A in (22,34)
> > >                 AND m2.PK_A NOT IN (12,56)
> > >         );
> > >
> > >
> > > "Daniel BODEA" <[EMAIL PROTECTED]> wrote on 10/26/2004 09:02:56
> AM:
> > >
> > > > Greetings,
> > > >
> > > > I have several solutions to a problem which requires temporary or
> cache
> > > > tables but the different solutions are time consuming to test for
> > > > performance and I was wondering if anyone had more information on
> this
> > > than
> > > > myself.
> > > >
> > > > The server is 3.23.40 without possibility for upgrade to 4.x because
> of
> > > the
> > > > amount of extra work required for testing, deployment, etc.
> > > >
> > > > The simplified context is that I have several tables for individual
> > > records
> > > > (REC_A, REC_B, etc.) with primary keys and one table for links
> between
> > > > different types of records based on their primary keys (LNK). LNK is
> N
> > > to N
> > > > so there is no constraint on the number of links a record can have
> to
> > > other
> > > > records.
> > > >
> > > > I have to query these relationships in complex ways, one example
> being
> > > > "select * from REC_A where (no links exist at all) or ((links to
> REC_B
> > > exist
> > > > for both PKs 22 and 34) and (no links exist to REC_B for either PK
> in 56
> > > or
> > > > 12))". The number of records is in the hundreds of thousands and the
> > > links
> > > > can get past a million.
> > > >
> > > > Now since this version does not support GROUP_CONCAT and having
> given up
> > > > after a couple of days of trying to torture recursive INNER and LEFT
> > > joins
> > > > to filter out records based on totally incomprehensible ON clauses,
> I
> > > figure
> > > > I need temporary or cache intermediary tables.
> > > >
> > > > The solutions that I see would be to have (1) a generic "flag" table
> > > that a
> > > > complex query would use to first flag the records that should get
> > > through
> > > > and then join and filter with that table to return the list of
> records
> > > but
> > > > that would mean locking and updating the table for each complex
> query,
> > > (2)
> > > > generate an "alternate view" of the LNK table by simulating the
> > > GROUP_CONCAT
> > > > function to aggregate the records in unique rows that contain the
> full
> > > set
> > > > of links. The question for (2) is how to store the full link set in
> one
> > > row
> > > > so that queries be able to make use of indexes. (3) would be a
> variation
> > > of
> > > > (1) by creating a temporary table per complex query that would only
> > > serve
> > > > the purposes of this query and which would contain only those record
> > > indexes
> > > > that should go through so that the complex query can then perform a
> > > simple
> > > > INNER JOIN to filter out the results.
> > > >
> > > > I've also thought about retrieving the full set of indexes for all
> > > matched
> > > > records programmatically and then reinjecting this in the main query
> as
> > > a PK
> > > > IN (...) but that would give really huge index sets that would go
> back
> > > and
> > > > forth between the server and the client so I ruled this one out.
> > > >
> > > > Now unless there is another simpler and miraculous solution that I
> > > haven't
> > > > thought of, I have to find out which one of these alternatives is
> the
> > > best
> > > > one speed-wise so any help on this is sincerely appreciated.
> > > >
> > > > Daniel
> > > >
> > > >
> > > > -- 
> > > > MySQL General Mailing List
> > > > For list archives: http://lists.mysql.com/mysql
> > > > To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]
> > > >
> > >
> >
>


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Querying on subsets of one-to-many and many-to-many

Reply via email to