Re: excessive time spent in statistics status

2007-09-17 Thread Pete Harlan
On Fri, Sep 14, 2007 at 01:33:51AM -0700, Jeremy Cole wrote:
 Hi Shawn, Lucio,
 
 SELECT STRAIGHT_JOIN 
 FROM ...
 LEFT JOIN ...
 WHERE ...
 ...
 
 Just to correct a point here... if a query uses only LEFT JOIN or RIGHT 
 JOIN, the join order is fixed by the query's order itself, so using 
 STRAIGHT_JOIN should have no effect whatsoever.

True, because you say only left or right joins.

 Equally important, since the join order is fixed when you use LEFT
 JOIN, you *must* list the joins in the correct order when writing
 the query, otherwise you will see very poor performance.

Only is missing from here, which could be misleading.  MySQL will
reorder the t0, t1 and t2 joins in:

select  ...
fromt0
join t1 on ...
join t2 on ...
left join t3 on ...
where   ...

 MySQL's optimizer cannot reorder the joins because it has the potential 
 to change the result of the query.

Do have an example in mind?

Thanks,

--Pete

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-17 Thread Baron Schwartz

Just a quick note to add to this conversation:

Pete Harlan wrote:

On Fri, Sep 14, 2007 at 01:33:51AM -0700, Jeremy Cole wrote:
MySQL's optimizer cannot reorder the joins because it has the potential 
to change the result of the query.


Not in all cases.  If a LEFT JOIN query also has a WHERE clause that 
disallows NULL rows from the right-hand table, it will (may?  I'm not 
sure if it always will, but at least some times it will) treat it as an 
INNER JOIN, and it's therefore a candidate for reordering.  I've also 
seen cases where a LEFT JOIN becomes a CROSS JOIN 
(http://bugs.mysql.com/bug.php?id=30842).


All RIGHT JOIN are rewritten to equivalent LEFT JOIN, so the same 
optimizations can apply.


As a side note, I've really benefited from attending Timour's talk on 
the optimizer at the conference this spring.  His slides are good 
reading too:

http://conferences.oreillynet.com/presentations/mysql07/katchaounov_timour.pdf

Baron

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-14 Thread Jeremy Cole

Hi Shawn, Lucio,


SELECT STRAIGHT_JOIN 
FROM ...
LEFT JOIN ...
WHERE ...
...


Just to correct a point here... if a query uses only LEFT JOIN or RIGHT 
JOIN, the join order is fixed by the query's order itself, so using 
STRAIGHT_JOIN should have no effect whatsoever.  Equally important, 
since the join order is fixed when you use LEFT JOIN, you *must* list 
the joins in the correct order when writing the query, otherwise you 
will see very poor performance.


MySQL's optimizer cannot reorder the joins because it has the potential 
to change the result of the query.


Regards,

Jeremy

--
high performance mysql consulting
www.provenscaling.com

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-14 Thread Lucio Chiappetti
On Fri, 14 Sep 2007, Jeremy Cole wrote:

 Just to correct a point here... if a query uses only LEFT JOIN or RIGHT JOIN,
 the join order is fixed by the query's order itself, so using STRAIGHT_JOIN
 should have no effect whatsoever.  Equally important, since the join order is

but experimentally using SELECT STRAIGHT_JOIN *does make* a difference. If 
I use it there is no time spent in the statistics phase, otherwise there 
is. Shawn can comment more conclusively since he knows the inner working 
of mysql. I just report a finding by experiment.

 should have no effect whatsoever.  Equally important, since the join order is
 fixed when you use LEFT JOIN, you *must* list the joins in the correct order
 when writing the query, otherwise you will see very poor performance.

However the correct order is not always obvious to be determined a 
priori (like in my cases where I have a glorified correlation table G 
which shall go first, but all member tables which go next with equal rank 
(except the first member which is more equal than the other =) ). The 
user can then build a query which uses only SOME of the members. And I've 
noticed that the result of EXPLAIN SELECT (the order) changed according to 
the content of the query (e.g. the particular WHERE condition). 

So for me experimentally use of SELECT STRAIGHT_JOIN is an effective 
solution.

-- 
---
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
---
()  ascii ribbon campaign - against html mail 
/\  http://arc.pasp.de/

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-05 Thread Lucio Chiappetti
On Tue, 4 Sep 2007, Shawn Green wrote:
  On Tue, 4 Sep 2007, Lucio Chiappetti wrote:

  For each I tested 3 cases (total 16*3=48) :
  
  a) the query on the virtual table correlated with the external
  (the virtual is my G left join t1 ... left join tn). [...]
  represents my REFERENCE, 

this case will also exploit full (or default) optimization

  b) the query on the VIEW V with members also (which implies a 
  redundant join of V with G left join t1 ... left join tn) and 
  correlated with an external table. [...] I used here SELECT 
  STRAIGHT_JOIN.

  c) the same query of (b) but with a normal select, and preceded by
 setting optimizer_search_depth=0 (auto).

 In your b) test, did you use the SELECT STRAIGHT_JOIN as your outer 
 SELECT statement or within the CREATE VIEW statement? 

No. I left the  CREATE VIEW alone (standard LEFT JOINs) in all cases, 
since that will be fully optimized, and used by a majority of users.
SELECT STRAIGHT_JOIN was used only in (b) [no optimization] and not in 
(c) [residual  optimization].

Our inclination would now be to use SELECT STRAIGHT_JOIN because it's the 
one involving only elegant changes to the code. Consider again that the 
case view + member also + other table will be rare (very few users) 
compare to view alone.

Anyhow all this was rather instructive.

Now I'll start experimenting with UNIONs too ...

-- 
---
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
---
()  ascii ribbon campaign - against html mail 
/\  http://arc.pasp.de/

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-04 Thread Lucio Chiappetti
On Mon, 3 Sep 2007, Shawn Green wrote:

 What I hope to do is to help you to make you queries work better by 
 applying hints, modifiers, and limits to the optimizer.

Many thanks, Shawn.

 STRAIGHT_JOIN is identical to JOIN, except ...
 
 You asked what would happen if you replaced all of your view's LEFT JOIN 
 clauses with STRAIGHT_JOIN clauses. [...] This is not what you are 
 trying to achieve.

I realized that. I overlooked is identical to JOIN since I generally use 
only LEFT JOINs to benefit of null returns (to us it is equally important 
to know source A has these 1/2/n counterparts and source A has no 
counterparts in the other catalogue/s).

I originally tested that STRAIGHT_JOIN made the explain select arrive to 
an end without looping forever, but as soon as I tried a real select I 
realized the different behaviour.

 However, there is another place you can put the STRAIGHT_JOIN modifier: 
 in the SELECT clause

I discovered that too, and it looks promising. I plan to test whether that 
makes any difference on the query result and execution time in some 
representative cases.

 Why are some sets of tables optimized more quickly than the others? My 
 best guess would be that some mix of WHERE conditions and table indexes 
 make it much faster to eliminate combinatorial permutations of JOIN 
 sequences than others.

I also realized that the WHERE part enters the optimization too ... this 
is of course unpredictable a priori, as it depends on the particular user 
needs.

 This is the variable optimizer_search_depth:

This also looks interesting thanks. I did some quick experimenting. For 
some reasons a SHOW VARIABLE tells me its normal value is 62. Apparently 
this does not depend on the database I select (I thought my real data 
database with the views and G's could have an higher value than e.g. my 
own administrative DB or the mysql DB, but it is always 62.
 
I tried lowering it to 10/15/20 and even to 0 (which according to 5.2.3
in the manual should be a sort of automatic value). In all cases the 
explain select on our maximal statement (the one which loops forever in 
statistics status under default conditions) reaches an end rather fast, 
and returns the same result (irrespective of 0/10/15/20).

The ORDER returned by playign with optimizer_search_depth is however
different from the one returned with SELECT STRAIGHT_JOIN. The difference 
is just in the order (the type, key and extra looks the same, in general 
ref or eq_ref and using index) so I suppose it should not make much 
difference.

 You can set this variable on a per-session basis so it would be possible 
 to tune it for each call to your views.

While I'm doing the tests under the mysql linemode client, our production 
environment is a Java jsp interface under tomcat. I have to check with my 
colleague dealing with java programming what is a session for him 
(probably a JDBC  connection in connection pool).

Actually what looks appealing in optimizer_search_depth vs SELECT 
STRAIGHT_JOIN is the possibility of fixing (no tuning per session) 
optimizer_search_depth to a decent default (0 ?) once forever, because 
this requires no changes to the code (however if SELECT STRAIGHT_JOIN will 
require changes concentrated in a few places).

I'll do some experimenting and report back.

-- 
--- 
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For 
more info : http://www.iasf-milano.inaf.it/~lucio/personal.html 
--- ()  
ascii ribbon campaign - against html mail /\ http://arc.pasp.de/

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-04 Thread Lucio Chiappetti
On Tue, 4 Sep 2007, Lucio Chiappetti wrote:

 I'll do some experimenting and report back.

In lack of better ways of doing a tie-break, I've done the following 
tests (with the linemode client), checking both the results of a query and 
the total time spent. I tested 16 different combinations of arbitrary 
WHERE conditions and correlation with an arbitrary external tables, using 
my maximal G (the one with 26 members).

For each I tested 3 cases (total 16*3=48) :

 a) the query on the virtual table correlated with the external
(the virtual is my G left join t1 ... left join tn). This involves
the lowest number of joins (29) and is of course the fastest (and 
also the easiest to get access to hidden columns by name, but
is not what I intend our users use because the way to select columns
is clumsy (they should use a VIEW on the 26 members instead ... 
normally with no correlation on external that will be equally fast), 
but represents my REFERENCE, i.e. I checked the results of the other
test to be the same as this.  

 b) the query on the VIEW V with members also (which implies a redundant
join of V with G left join t1 ... left join tn) and correlated with
an external table. Doubles the joins (59). I used here
SELECT STRAIGHT_JOIN.  

 c) the same query of (b) but with a normal select, and preceded by
setting optimizer_search_depth=0 (auto).

I did each test in a fresh mysql session to prevent cached valued to 
affect the timings.

The good news are that all query go to end without getting stuck in the 
statistics state, and that (a),(b),(c) for the same query return the same 
results.

Considering the timing instead :

 (a) are obviously the fastest (from 0.0 to 0.63 seconds at worst)

 (b) the queries with SELECT STRAIGHT_JOIN (not optimized ?) are
 *in general* the slowest but not too slow, i.e. complete within from 
 0.8 to 1.9 sec. There are however a couple of cases involving one 
 external table in which they are slower (though not unfeasible),
 from 3 to 12 sec in one case, and from 24 to 45 sec in another.

 (c) the queries with optimizer_search_depth=0 have a speed comparable
 with (b). *In general* they are marginally faster than (b) (0.7
 to 1.6 sec) when (b) is reasonably fast.  When (b) is slow,
 however (c) is TWICE AS SLOWER (5 to 23 and 48 to 89 sec)

The explain select does not give obvious clues while those particular 
table combinations are slower, and, considered that the queries involving 
view + member also + external concern an absolute minority of users [*] 
it is not worth spending more time investigating.

[*] actually a larger minority (I hope it remains such) of our public
users accesses the DB once to take away all data without performing any 
selection  :-(

My inclination therefore would be to prefer optimizer_search_depth=0
to SELECT STRAIGHT_JOIN *if* it can be easily arranged in our tomcat 
environment, because of the marginal increase in speed despite the 
occasional worsening. But the two look almost equivalent.

-- 
---
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
---
()  ascii ribbon campaign - against html mail 
/\  http://arc.pasp.de/

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-04 Thread Shawn Green

Lucio Chiappetti wrote:

On Tue, 4 Sep 2007, Lucio Chiappetti wrote:


I'll do some experimenting and report back.


In lack of better ways of doing a tie-break, I've done the following 
tests (with the linemode client), checking both the results of a query and 
the total time spent. I tested 16 different combinations of arbitrary 
WHERE conditions and correlation with an arbitrary external tables, using 
my maximal G (the one with 26 members).


For each I tested 3 cases (total 16*3=48) :

 a) the query on the virtual table correlated with the external
(the virtual is my G left join t1 ... left join tn). This involves
the lowest number of joins (29) and is of course the fastest (and 
also the easiest to get access to hidden columns by name, but

is not what I intend our users use because the way to select columns
is clumsy (they should use a VIEW on the 26 members instead ... 
normally with no correlation on external that will be equally fast), 
but represents my REFERENCE, i.e. I checked the results of the other
test to be the same as this.  


 b) the query on the VIEW V with members also (which implies a redundant
join of V with G left join t1 ... left join tn) and correlated with
an external table. Doubles the joins (59). I used here
SELECT STRAIGHT_JOIN.  


 c) the same query of (b) but with a normal select, and preceded by
setting optimizer_search_depth=0 (auto).
snip


In your b) test, did you use the SELECT STRAIGHT_JOIN as your outer 
SELECT statement or within the CREATE VIEW statement? If you only tried 
it one way, you could try it the other, too.




[*] actually a larger minority (I hope it remains such) of our public
users accesses the DB once to take away all data without performing any 
selection  :-(




It saddens me to see people abuse your hard work in this way.


My inclination therefore would be to prefer optimizer_search_depth=0
to SELECT STRAIGHT_JOIN *if* it can be easily arranged in our tomcat 
environment, because of the marginal increase in speed despite the 
occasional worsening. But the two look almost equivalent.




Another option would be to build your view dynamically based on the 
criteria that a user selects through a web-based interface.


One other way to approach this project would be to normalize your 
relationships and have one skinny but very tall table similar to


CREATE TABLE map_table (
  object1_id int
  , object1_type int
  , object2_id int
  , object2_type int
  , confidence tinyint
)

where confidence would be a whole number from 0 to 100.  You would need 
 only one of these to replace each G table you are generating now.


However, that would be a major difference in how you currently use your 
data and I would not suggest this for a near-term solution.

--
Shawn Green, Support Engineer
MySQL Inc., USA, www.mysql.com
Office: Blountville, TN
__  ___ ___   __
   /  |/  /_ __/ __/ __ \/ /
  / /|_/ / // /\ \/ /_/ / /__
 /_/  /_/\_, /___/\___\_\___/
___/
 Join the Quality Contribution Program Today!
 http://dev.mysql.com/qualitycontribution.html

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-03 Thread Lucio Chiappetti
As I'm new on this list and can't find a policy statement, I assume it 
is OK to reply to the list in discussion list fashion. If instead it is 
preferred to reply to each sender privately and later I'd post a summary, 
please let me know and I'll comply.

On Sat, 1 Sep 2007, Shawn Green wrote:
 Lucio Chiappetti wrote:

  I have some queries, involving a largish number of JOIN, which are
  [...] very slow or [...] remain in the statistics status [forever]

  This involved creating a working table G
  [...]
  The simultaneous access was achieved in our interface by a mechanism
  we called virtual tables, which essentially was
  
  SELECT
  some subset of columns in some of the t0...tn
  or some expression thereof
  FROM
  G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
  ...
left join tn on G.tn=tn.seq
  
  We refer to the t0...tn as the member tables of G.

 I normally do not get lost in symbolic descriptions such as yours. 
 However your description of G and how you build it leaves me in the 
 dark.
 
 Can you show me a few sample rows of G (symbolically, if you like) and 
 describe what is in each column G.t0 to G.tn?  What I am curious to know 
 is what do each of these n object have in common that allows them to 
 represented as a single tuple on the G table.

The explanation was already (too tersely ?) contained in my sentence

  This table has columns named t0 t1 ... tn (containing the pointers
  t0.seq t1.seq ... for counterparts associated and validated according
  to some criteria) plus other service columns

Let me make an example. 

t0 is a table of X-ray sources, it contains sky coordinates (ra,dec), 
count rates and fluxes in 5 energy bands, detection probabilities etc.
In addition it contains a column named seq which is just an 
auto_increment sequential number, and is the unique way to identify a 
given source (there is a further complication due to the fact some records 
are redundant detections of the same source, but since they are not used 
in building the G's you'd better ignore this).

t1 might be e.g. a table of optical sources, with sky coordinates, 
magnitudes, and, of course, its own seq. There may be different tables 
of optical sources (t1, t2, t3). 

Or t4 can be a table of radio sources, with sky coordinates, fluxes, and 
its seq.

Or t5 can be a table of URLs into external astronomical sites like SIMBAD 
or NED (if you know them), again with sky coordinates and a seq.

Some of all these tables have their own identifiers, but sometimes these 
aren't unique, or aren't numeric. So in general these tables MAY have an 
UNIQUE PRIMARY index which can be either the original identifier, or some 
combination of original columns (for instance a source identifier and a 
field identifier) and an unique auxiliary key which is my auto_increment 
seq (numeric, built at data ingestion). This again is mostly irrelevant to 
you. The point is that each table has an unique auto_increment seq.

A G table will simply contain the seq's in the member tables. For 
ease of use the column names in G will be the table names of the member 
tables. I indicated them as t0 t1 t2 ... that their actual names are e.g. 
nov06, d1t3, ukidss, radio, simbad is irrelevant.

So a record in G may contain for instance :

 - its own seq 253719 (do not be worried by the fact the number is large 
   there are lots of gaps for records removed during construction)
 - the seq in the X-ray table (t0) : 1521
 - the seq in an optical table (t1) : 1229
 - the seq in another optical table (t2) : 42168
 - the seq in the radio table (t3) : null
 - the seq in an IR table (t4) : 9

Another record with seq 260429 can have the same t0=1521, but e.g. the seq 
in the IR table t4=11, and all other t1 t2 t3 null.

Etc. etc.

Essentially G says that X-ray source 1521 can have up to 2 (or 1 or 7 or 
whatever) potential counterparts, one is optical t2=1229 which is the same 
as optical t3=42168 and the same as IR t4=9 ; the other is only IR t4=11, 
etc. etc.

All associations are pre-computed via some sort of other (proximity) 
analysis.

 You also mention other service columns. What kinds of information are 
 you keeping in those?

Information which is irrelevant to the present discussion, except for a 
marginal point (see below). Like for instance a numeric rank which says 
that the association 1521/1229/42168/null/9 is preferred, and the 
association 1521/null/null/null/11 is unlikely, or to be rejected. Or 
flags produced during the identification. Or the chance probabilities that 
the association of a source in t0 and t1 or t2 is real considered the 
distance and the density of objects having a given magnitude.

  We have different versions of G corresponding to different sets
  of member tables and different association criteria.

 Each G ?  Again, that makes the concept of what a G really is more 
 confusing to me. I understand databases and I know more than a little 
 about stellar cartography, cosmology, and 

Re: excessive time spent in statistics status

2007-09-03 Thread Shawn Green

Hello Lucio,

Thank you for the excellent description of you problem. I believe I 
completely understand both the data you are handling and the problems 
you are facing.


I would not ask you to change your schema at this point. There is far 
too much work put into it at this phase to suggest a redesign. What I 
hope to do is to help you to make you queries work better by applying 
hints, modifiers, and limits to the optimizer.


http://dev.mysql.com/doc/refman/5.0/en/join.html states:
STRAIGHT_JOIN is identical to JOIN, except that the left table is 
always read before the right table. This can be used for those (few) 
cases for which the join optimizer puts the tables in the wrong order.


You asked what would happen if you replaced all of your view's LEFT JOIN 
clauses with STRAIGHT_JOIN clauses. When used in the FROM clause (as 
part of the definitions of where the data comes from) STRAIGHT_JOIN 
would be equivalent to a JOIN which is equivalent to an INNER JOIN. This 
is not what you are trying to achieve.


However, there is another place you can put the STRAIGHT_JOIN modifier: 
in the SELECT clause 
(http://dev.mysql.com/doc/refman/5.0/en/select.html) . Quoting again:
 STRAIGHT_JOIN forces the optimizer to join the tables in the order in 
which they are listed in the FROM clause. You can use this to speed up a 
query if the optimizer joins the tables in non-optimal order. See 
Section 6.2.1, “Optimizing Queries with EXPLAIN”. STRAIGHT_JOIN also can 
be used in the table_references list. See Section 12.2.7.1, “JOIN Syntax”.


SELECT STRAIGHT_JOIN 
FROM ...
LEFT JOIN ...
WHERE ...
...

This will allow you to keep your LEFT JOINs in the table reference 
portion of your query (everything between FROM and WHERE) but avoid all 
of the permutations the optimizer performse related to trying to analyze 
which table to join first to which other table.  Why are some sets of 
tables optimized more quickly than the others? My best guess would be 
that some mix of WHERE conditions and table indexes make it much faster 
to eliminate combinatorial permutations of JOIN sequences than others.


There is another variable you could use to minimize how many table 
permutations the optimizer will examine. This is the variable 
optimizer_search_depth: 
(http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#option_mysqld_optimizer_search_depth)


(My apologies if that link wraps incorrectly). You can set this variable 
on a per-session basis so it would be possible to tune it for each call 
to your views. I know that would be quite the administrative hassle to 
implement setting this variable for each call to your views but I 
provide it here for completeness.


Please try out SELECT STRAIGHT_JOIN and let me know if your situation 
improves.


BTW - we encourage everyone to reply to the full list on all responses 
(unless they are confidential) so that all members can gain from the 
knowledge transfer.

--
Shawn Green, Support Engineer
MySQL Inc., USA, www.mysql.com
Office: Blountville, TN
__  ___ ___   __
   /  |/  /_ __/ __/ __ \/ /
  / /|_/ / // /\ \/ /_/ / /__
 /_/  /_/\_, /___/\___\_\___/
___/
 Join the Quality Contribution Program Today!
 http://dev.mysql.com/qualitycontribution.html



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: excessive time spent in statistics status

2007-09-01 Thread Shawn Green

Hello Lucio,

(reply below)
Lucio Chiappetti wrote:
I have some queries, involving a largish number of JOIN, which 
are apparently very slow or even take forever (a mysqladmin processlist 
shows them remain in the statistics status for a long time, in most 
cases I have to kill them after several minutes).


When I first had the problem I googled around and found some reference 
(which I've lost) saying that the statistics status is actually what one 
does with an EXPLAIN SELECT, and that this is done preliminarily to the 
actual query. It also said it might occur with a large number of joins 
because this analysis, for n joins MIGHT try up to n! combinations, unless 
one somehow specified the priorities (but the author did not remember 
how).



You can find those explanations in our manual at 
http://dev.mysql.com/doc/refman/5.0/en/thread-information.html




I thought to have overcome the problem using a feature of the CREATE 
VIEW command (see below), but apparently I simply moved it to an higher n.


Now I tried to see how it scales with the number of joins, and the curious 
things is that e.g. for n=9 it works fast, for n=15 it works slowly, for 
n=18 works fast again and for n=20 takes an infinite time.


I'll first explain my background :

 - I have a number of tables (let's call them t0, t1, t2 ...)
 - all of them have an auto_increment column called seq which is also
   an index

 - one table (t0) is more important (actually it is list of
   celestial X-ray sources while the other are celestial objects
   in other wavebands but this is irrelevant to you).

 - I have precomputed correlation tables among t0 and each of
   the other. These tables are called eg. t0ti, have two columns
   t0 and ti corresponding to the t0.seq and ti.seq of objects which
   are associated. They are indexed on (t0,ti). Note that an
   object in t0 can be associated with 1 or more or zero (t0ti.ti null)
   objects in ti.

 - I originally (already under mysql 3) devised a way to identify 
   counterparts in MORE tables (all these associations are based on 
   spherical distance + other criteria). This involved creating a

   working table G

   This table has columns named t0 t1 ... tn (containing the pointers
   t0.seq t1.seq ... for counterparts associated and validated according 
   to some criteria) plus other service columns


   The simultaneous access was achieved in our interface by a mechanism
   we called virtual tables, which essentially was

 SELECT
  some subset of columns in some of the t0...tn
  or some expression thereof
 FROM
  G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq


   We refer to the t0...tn as the member tables of G.



I normally do not get lost in symbolic descriptions such as yours. 
However your description of G and how you build it leaves me in the dark.


Can you show me a few sample rows of G (symbolically, if you like) and 
describe what is in each column G.t0 to G.tn?  What I am curious to know 
is what do each of these n object have in common that allows them to 
represented as a single tuple on the G table.


You also mention other service columns. What kinds of information are 
you keeping in those?





   We have different versions of G corresponding to different sets
   of member tables and different association criteria.

   The largest of our cases has 26 different members.
snip
 - for each G we define a view as



Each G ?  Again, that makes the concept of what a G really is more 
confusing to me. I understand databases and I know more than a little 
about stellar cartography, cosmology, and physics. Please don't hold back.




snipped (to be revisited later)

Questions :

 - what does explain select actually do and why sometimes hangs ?

 - can this be overcome rearranging the order of the joins (note that
   the bulk of the members are all joined with G), or introducing
   parentheses or with other syntax changes ?

 - or has it to do with some configuration parameter, maybe related
   to what is cached, cache size or other ?
 
Thanks in advance to whoever is able to give hints.




EXPLAIN SELECT simply stops a normal SELECT statement from actually 
performing the data retrieval steps and shows us (the users) a 
description of the techniques the query engine was about to use to get 
at the data. That means that the steps of parsing the query, tokenizing 
the symbols, and optimizing the execution plan still take place. It is 
during this optimization phase that most of your CPU time is being used 
as the engine will work many permutations of joining one table to 
another until it reaches a decision about which plan is less expensive 
that all of the others.


http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
http://dev.mysql.com/doc/refman/5.0/en/left-join-optimization.html
and the rest of the optimization chapter describe this process in 
considerable details


excessive time spent in statistics status

2007-08-31 Thread Lucio Chiappetti
I have some queries, involving a largish number of JOIN, which 
are apparently very slow or even take forever (a mysqladmin processlist 
shows them remain in the statistics status for a long time, in most 
cases I have to kill them after several minutes).

When I first had the problem I googled around and found some reference 
(which I've lost) saying that the statistics status is actually what one 
does with an EXPLAIN SELECT, and that this is done preliminarily to the 
actual query. It also said it might occur with a large number of joins 
because this analysis, for n joins MIGHT try up to n! combinations, unless 
one somehow specified the priorities (but the author did not remember 
how).

I thought to have overcome the problem using a feature of the CREATE 
VIEW command (see below), but apparently I simply moved it to an higher n.

Now I tried to see how it scales with the number of joins, and the curious 
things is that e.g. for n=9 it works fast, for n=15 it works slowly, for 
n=18 works fast again and for n=20 takes an infinite time.

I'll first explain my background :

 - I have a number of tables (let's call them t0, t1, t2 ...)
 - all of them have an auto_increment column called seq which is also
   an index

 - one table (t0) is more important (actually it is list of
   celestial X-ray sources while the other are celestial objects
   in other wavebands but this is irrelevant to you).

 - I have precomputed correlation tables among t0 and each of
   the other. These tables are called eg. t0ti, have two columns
   t0 and ti corresponding to the t0.seq and ti.seq of objects which
   are associated. They are indexed on (t0,ti). Note that an
   object in t0 can be associated with 1 or more or zero (t0ti.ti null)
   objects in ti.

 - I originally (already under mysql 3) devised a way to identify 
   counterparts in MORE tables (all these associations are based on 
   spherical distance + other criteria). This involved creating a
   working table G

   This table has columns named t0 t1 ... tn (containing the pointers
   t0.seq t1.seq ... for counterparts associated and validated according 
   to some criteria) plus other service columns

   The simultaneous access was achieved in our interface by a mechanism
   we called virtual tables, which essentially was

 SELECT
  some subset of columns in some of the t0...tn
  or some expression thereof
 FROM
  G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq

   We refer to the t0...tn as the member tables of G.

   We have different versions of G corresponding to different sets
   of member tables and different association criteria.

   The largest of our cases has 26 different members.

   Our mechanism was such that we defined a subset of columns in
   each of the ti (or expressions thereof like distances etc.) as
   interesting, with an associated alias. Our interface usually
   showed only such virtual columns, but had a possibility to add
   (naming them manually as ti.colname) to the SELECT also all
   other member columns normally hidden.

   We also allow to correlate a virtual table with a single physical
   table tk (be it member or not) using the t0tk correlation table
   (t0 is the First Member).

 - the above worked and still works, but has some clumsiness. When
   we upgraded to mysql 5 and discovered the CREATE VIEW command
   we decided to replace our virtual tables with views.

 - for each G we define a view as

 create algoritm=temptable view V as
 SELECT
  some subset of columns in some of the t0...tn or in G
  or some expression thereof
 FROM
  G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq

 - the algorithm=temptable was required because without it some
   of our queries (see below) entered in the statistics status forever 
   already with 11 members
   
 - for the rest the VIEWs work nicely when used standalone and are easier
   for the user ...

 - ... but on the other hand they HIDE the member columns which are
   not explicitly named in CREATE VIEW (where one wants to keep a
   manageable number of columns). Hide means here that their names
   ti.colname cannot be used in SELECT !   

 - so we devised an option by which on ticking on show members also
   one can also include these ti.colname in the query
   
   de facto this doubles the joins, because the statement built is

   SELECT 
list of (V.colname and ti.colname with i chosen among 0 and n)
   FROM
( G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
)   left join V  on G.seq=V.seq 

   This statement NOW works (it did not work with e.g. 11 member tables
   before we switched to ALGORITHM=TEMPTABLE in the CREATE VIEW).

   An explain select for a query on such views gives that a view with
   n members with members also