As I'm new on this list and can't find a "policy statement", I assume it is OK to reply to the list in "discussion list" fashion. If instead it is preferred to reply to each sender privately and later I'd post a summary, please let me know and I'll comply.
On Sat, 1 Sep 2007, Shawn Green wrote: > Lucio Chiappetti wrote: > > I have some queries, involving a largish number of JOIN, which are > > [...] very slow or [...] remain in the "statistics" status [forever] > > This involved creating a working table G > > [...] > > The simultaneous access was achieved in our interface by a mechanism > > we called virtual tables, which essentially was > > > > SELECT > > some subset of columns in some of the t0...tn > > or some expression thereof > > FROM > > G left join t0 on G.t0=t0.seq > > left join t1 on G.t1=t1.seq > > ... > > left join tn on G.tn=tn.seq > > > > We refer to the t0...tn as "the member tables" of G. > I normally do not get lost in symbolic descriptions such as yours. > However your description of G and how you build it leaves me in the > dark. > > Can you show me a few sample rows of G (symbolically, if you like) and > describe what is in each column G.t0 to G.tn? What I am curious to know > is what do each of these n object have in common that allows them to > represented as a single tuple on the G table. The explanation was already (too tersely ?) contained in my sentence > > This table has columns named t0 t1 ... tn (containing the pointers > > t0.seq t1.seq ... for counterparts associated and validated according > > to some criteria) plus other service columns Let me make an example. t0 is a table of X-ray sources, it contains sky coordinates (ra,dec), count rates and fluxes in 5 energy bands, detection probabilities etc. In addition it contains a column named "seq" which is just an auto_increment sequential number, and is the unique way to identify a given source (there is a further complication due to the fact some records are redundant detections of the same source, but since they are not used in building the G's you'd better ignore this). t1 might be e.g. a table of optical sources, with sky coordinates, magnitudes, and, of course, its own "seq". There may be different tables of optical sources (t1, t2, t3). Or t4 can be a table of radio sources, with sky coordinates, fluxes, and its "seq". Or t5 can be a table of URLs into external astronomical sites like SIMBAD or NED (if you know them), again with sky coordinates and a "seq". Some of all these tables have their own identifiers, but sometimes these aren't unique, or aren't numeric. So in general these tables MAY have an UNIQUE PRIMARY index which can be either the original identifier, or some combination of original columns (for instance a source identifier and a field identifier) and an unique auxiliary key which is my auto_increment seq (numeric, built at data ingestion). This again is mostly irrelevant to you. The point is that each table has an unique auto_increment seq. A "G" table will simply contain the "seq's" in the "member tables". For ease of use the column names in G will be the table names of the member tables. I indicated them as t0 t1 t2 ... that their actual names are e.g. "nov06", "d1t3", "ukidss", "radio", "simbad" is irrelevant. So a record in G may contain for instance : - its own seq 253719 (do not be worried by the fact the number is large there are lots of gaps for records removed during construction) - the seq in the X-ray table (t0) : 1521 - the seq in an optical table (t1) : 1229 - the seq in another optical table (t2) : 42168 - the seq in the radio table (t3) : null - the seq in an IR table (t4) : 9 Another record with seq 260429 can have the same t0=1521, but e.g. the seq in the IR table t4=11, and all other t1 t2 t3 null. Etc. etc. Essentially G says that X-ray source 1521 can have up to 2 (or 1 or 7 or whatever) potential counterparts, one is optical t2=1229 which is the same as optical t3=42168 and the same as IR t4=9 ; the other is only IR t4=11, etc. etc. All associations are pre-computed via some sort of other (proximity) analysis. > You also mention other "service columns". What kinds of information are > you keeping in those? Information which is irrelevant to the present discussion, except for a marginal point (see below). Like for instance a numeric rank which says that the association 1521/1229/42168/null/9 is preferred, and the association 1521/null/null/null/11 is unlikely, or to be rejected. Or flags produced during the identification. Or the chance probabilities that the association of a source in t0 and t1 or t2 is real considered the distance and the density of objects having a given magnitude. > > We have different versions of G corresponding to different sets > > of member tables and different association criteria. > "Each G" ? Again, that makes the concept of what a G really is more > confusing to me. I understand databases and I know more than a little > about stellar cartography, cosmology, and physics. Please don't hold > back. Each G in the sense of different versions. I may have one G starting from a t0 with X-ray sources in an area of the sky, and another with a t0 in another area or coming from a different analysis. Or even with the same t0 I may have a G whose members are t0 t1 t2 t3 (identification done two years ago with tables available then) and a G whose members are t0 t1 t5 t6 t7 t8 t9 t10 (identification done now, with more tables, or with newer versions of some tables). Again this is irrelevant for MY PROBLEM, except for the following statement : - I have no problems with the SELECT statements I reported above for whatever number of t0...tn member tables. Neither if I issue the statement as such, nor if I encapsulate it in a CREATE VIEW - I start encounter problems if I want to join the VIEW created with the above statement with something else, when n is slighly large (e.g. a G with 11 members), but if I use ALGORITHM=TEMPTABLE the problem goes away, at least for n=26 (my largest G). - I have a problem again for a more complex join, which arises around n=20. Actually it's look like further experimenting shows a solution, but I'm still unsure whether that gives the same results of the old one : such a solution is : replace all LEFT JOIN by STRAIGHT_JOIN Maybe you'd just want to comment on that ? > Please do explain the data problems you are trying to solve in some more > detail as I may be able to help you to design a less cumbersome method > of achieving the same goals. I thank you for your offer. However we do not want to revolutionize all our system and interfaces (Shawn, if you are interested I can give you privately more specific pointers), which has been working well since some years and used inside our consortium, and has just "gone public" for some datasets. The "data problem" is likely to have no or little relation with the astronomical nature of the data, or the background explained above (which howver I hope was of some interest), but can be summarized as follows. Note the problems occur ONLY in the latest steps, which are those of lesser importance for the user (or important for less users) : - there are n tables (all with their "seq" column), which contain several columns, of which some are particularly interesting, and some other are less interesting - these tables are associated via a "G" table which links all the seqs of associated objects. This is precomputed once forever. - the majority of the users will be interested in a small number of interesting columns taken from some of the member tables (or expressions thereof). Howevever we wanted to screen them from entering in a SELECT statement columns explicitly by names like nov06.seq, nov06.fluxb, w1t3.magip or even dist(nov06.ra_corr,nov06.dec_corr,w1t3.ra,w1t3.decl)*3600 (where dist is an UDF) ... and I have even worse expressions. So we originally devised our own java interface to somehow hide the expansion of such aliases, and later embedded the aliases in a CREATE VIEW. The "FROM" of such statements is the one listed above, involving G and a sequence of left joins on t0 to tn. This works nicely and fast in both the old (no view) and new (with view) way. - a minority of users could be interested in accessing also some of the columns in the member tables, which are NOT listed in our non-hidden list or are NOT members of the view. In the old arrangement, we simply had a check box which enabled "view member also" and un-hided the hidden columns in a menu, Also the user could simply type their name (say nov06.snrcd) in the SELECT if he knew the name. Such statement was the SAME involving G and a sequence of left joins on t0 to tn. It is just a matter of listing more columns in the "select" before the "from". However if the new arrangement uses a view V to hide the entire statement, if I WANT USERS TO BE ABLE ALSO TO SEE HIDDEN COLUMNS (if you want THIS is the data problem !!!) I have to explicitly repeat the list of joins. I cannot issue select nov06.snrcd from V because nov06.snrcd was not included in the definition of the view (and there are too many columns in the n members t0...tn to be all named in a viable view). So I generate a statement like that listed at about 2/3 of the post which started the thread SELECT list of (V.colname and ti.colname with i chosen among 0 and n) FROM ( G left join t0 on G.t0=t0.seq left join t1 on G.t1=t1.seq ... left join tn on G.tn=tn.seq ) left join V on G.seq=V.seq Again this works provided one uses ALGORITHM=TEMPTABLE in the CREATE VIEW. Despite the fact some redundant queries are issued. - a real minority of users (me only ? for debugging purposes ? it happened me once that I detected and fixed a typo that way ) could be interested in accessing not only the interesting columns in the view AND the hidden columns in the members, but also columns in a FURTHER TABLE tk (e.g. which is not yet a member, or is a new release of some member which I have to evaluate whether to replace). This involves a more complex join ... which is listed towards the end of the original post and I won't repeat here Such join involves one further view which in turn is a join of the G and a correlation table between tk and the X-ray table t0 (essentially listing all seqs in t0 and tk whose object are closer than a predefined sky distance). It is this latter statement (of rare use, hence my reluctance to revolutionize all the rest) which enters the "statistics state" forever for large number of members. However, as I said, replacing LEFT JOIN by STRAIGHT_JOIN seems to solve it. Now coming to more general issues, I have read (or re-read) most of the manual pages you quoted, but your explanation below is what condenses more effectively what happens. > EXPLAIN SELECT simply stops a normal SELECT statement from actually > performing the data retrieval steps and shows us [...] I used in the past EXPLAIN SELECT mainly to check whether introducing an index (or sometimes a parenthesized order) improved the performance, although why it did it was sort of black magic. > It is during this optimization phase that most of your CPU time is being > used as the engine will work many permutations of joining one table to > another until it reaches a decision about which plan is "less expensive" > that all of the others. So essentially this matches what a lost Google reference said, that in case of many joins an excessive times may be spent checking up to n! combinations unless one somehow "forces the order". What I gather from some hints in the documentation (and my very rough and quick test) is that STRAIGHT_JOIN might be the way to force the order (but will it have some unpleasant side effects or is it safe ?) Thanks again. -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]