Re: Result row count *without* buffering all results?
Harrison Fisk wrote: Hi Ken, On Sep 18, 2005, at 3:53 PM, Ken Gieselman wrote: Scott Gifford wrote: Ken Gieselman [EMAIL PROTECTED] writes: [...] So, here's my question: How can I find out how many rows are being returned by a query, without using mysql_stmt_store_result() to buffer all the rows back to the client? One straightforward way is to replace the SELECT clause with just SELECT COUNT(*) and execute that query first. -ScottG. Yeah, that's my fall-back option -- though the thought of executing the query twice is a bit daunting. Some of the tables run into billions of rows per year (the merge tables anyhow, the data tables are broken down by month to keep them from becoming totally unmanageable), and a multi-year search can take a while to grind out. Seems to me that by the time the first query execution is done, the server should *know* exactly how many rows are in the result set -- just need to find the magic trick that allows the client to query that number, rather than counting for itself as it buffers them. The problem is that your assumption isn't always correct. The server doesn't always know how many rows are in a result set before it starts returning rows back to the client. Imagine if you did a simple SELECT * FROM tbl WHERE unindexed_col = 5; statement. The server certainly isn't going to read in and count the number of rows before beginning to send the rows back to the client, for the same reason that you don't want to do a store_result, it will take up too much resources. The only way for the server to know 100%, is to count the rows as they are being returned to the client (It could in theory do it for some statements, such as where it has to do a filesort, however an API that only sometimes worked based on the execution plan wouldn't be very useful) . Regards, Harrison That makes a fair amount of sense :) I guess I'd just envisioned it as having a list of pointers to matches or something similar internally. OK, given the limitations, is there an effective method to *estimate* what's likely to come back? I've played a bit with pulling the numbers out of an EXPLAIN SELECT... but it seems that they tend to come in low, if the query is using indexes, or high (all rows) if not. Appreciate the input guys! Thanks a bunch. ken -- === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Result row count *without* buffering all results?
Scott Gifford wrote: Ken Gieselman [EMAIL PROTECTED] writes: [...] So, here's my question: How can I find out how many rows are being returned by a query, without using mysql_stmt_store_result() to buffer all the rows back to the client? One straightforward way is to replace the SELECT clause with just SELECT COUNT(*) and execute that query first. -ScottG. Yeah, that's my fall-back option -- though the thought of executing the query twice is a bit daunting. Some of the tables run into billions of rows per year (the merge tables anyhow, the data tables are broken down by month to keep them from becoming totally unmanageable), and a multi-year search can take a while to grind out. Seems to me that by the time the first query execution is done, the server should *know* exactly how many rows are in the result set -- just need to find the magic trick that allows the client to query that number, rather than counting for itself as it buffers them. Thanks! ken -- === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Result row count *without* buffering all results?
Hi Folks -- I'm working on a scientific data warehousing setup using MySQL (currently running 4.1.13 on SPARC/Solaris 9), and using the C prepared statements API for pulling numeric data in and out. So, here's my question: How can I find out how many rows are being returned by a query, without using mysql_stmt_store_result() to buffer all the rows back to the client? The problem with buffering it all back is that result sets are usually gigabytes, sometimes even terabytes -- there's just no way that most of the client systems can stuff that all into memory. I need to find a decent way to break up the result sets into chunks, but doing that is tricky without knowing just how many rows are coming back :) I've tried playing with SQL_CALC_FOUND_ROWS in the queries, but I don't think I can execute a second query via the same database connection before dealing with the entire result set from the main query. Is there a good way to execute multiple queries in parallel via one connection? Suggestions, questions, even flames welcome. Thanks in advance! ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Complicated Subquery help
Hiya Folks! I'm struggling with a complicated subquery issue, and my SQL-fu isn't strong enough to make a solution apparent. Hopefully one of you experts can give me a nudge in the right direction :) The query utilizes data from just one table, which looks like: idX integer idY tinyint unsigned, idZ tinyint, c1 smallint, c2 smallint, c3 smallint, ... [and so on] ... What I'd like to do is select a set of the channel data (the c* fields) based on rows where the standard deviation on a given channel is below a certain threshold for the rows near that one, based on the values of the id fields. Getting the standard deviation on a single field for a given row isn't too bad: select std(c2330) from radiances where idZ between 44 and 46 and idX between 12 and 14, and idX = 7; That query selects the standard deviation for channel 2330, in the 9 footprints around the current point I'm looking at. Selecting the channels I need to analyze based on that should just using that select inside an IF statement. The trick is, how to automate this to iterate over all the id* fields? idX is an auto_increment counter based on observation time, idY and idZ are tied to specific observations within a given set, and are integer counters that vary from 0-150 or so. Is there a way to do this with one (or more) sql queries, or would I be better off writing a specific program for doing the selection? Thanks for the help! ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Partial Row Reads?
Quoting Tom Cunningham [EMAIL PROTECTED]: I think what Harald is saying ( if he's not, then I say this): You could have an alternative table structure like this: - it should make queries much quicker: create table raddata_2004_10_ONE ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c1 float not null, ... c500float not null, ) create table raddata_2004_10_TWO ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c501 float not null, ... c1000float not null, ) But maybe with a more logical way of partitioning the columns among the different tables. Yeah, we looked at doing something like this. It would make the tables smaller, though queries would be much more painful. Unfortunately, there's not really any common pattern for access to the various detector values. Each of the people analyzing the data has a different set that they like to work with, depending on the information they're trying to extract. The merge tables would also require MASSIVE numbers of filehandles, since it would add another 5x to the eventual number of tables (all of the tables are aggregated over time via merge tables for the users). Or, another option, perhaps you've thought of, have a table like this: should make indexing much worse, but querying much better. create table raddata_2004_10_TWO ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, cID smallint not null, cValue float not null, primary key (granID, scanID, fpID, cID) index (granID, cID) ) OLD QUERY: Select c1 from raddata_2004_10 where granID between 147568 and 15 NEW QUERY: Select cValue from raddata_2004_10 where granID between 147568 and 15 and cID=1; (should be v. fast) Interesting :) It would make reads a lot faster, but I see a couple of downsides. First, it means that I'm duplicating the 6 bytes worth of ID fields (plus the row/index overhead) a LOT more -- the main problem with this setup is the volume of data that I'm trying to address. With 2500 columns, I get 12150 rows per granule, or just short of 3 million rows a day. With the structure you suggested, I'd get 7.29 billion rows a day... wonder how the mysql engine would handle 235 Billion rows per month in a table. Add the extra overhead, and I don't think we could manage the storage requirements (it works out to just under 10TB a year as it is now, with the other tables) -- works out to 2.625 TB a month for just the one table this way, unfortunately. --- incidentally: I have a question: when you have a table like this with a primary key which has a lot of columns, is there any performance benefit to adding a new primary key, as an auto-increment column, keeping the old primary key as a unique index? I thought maybe there might be some addressing / hashing issues which worked out quicker? Interesting idea. Not sure what the gain would be, at least in this case, however. Most of the queries are based on channel values, or other data like geolocation or time, and just related to this table by the 3 id fields. I'd be willing to give it a shot, but not sure what an autoincrement field would gain when the queries aren't based on insert-order. Thanks for the insight! Appreciate all the suggestions that you guys are throwing into the hat! ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Partial Row Reads?
Hi Folks -- Ran into a couple performance issues, and looking for some optimization tips :) I'm currently using MySQL 4.1.5-gamma, built from the bitkeeper tree a month or so ago. I have a table which is roughly 2500 columns by 91 million rows (I get 4 of these a month, from the data we're receiving). The tables are MyISAM, and look roughly like: create table raddata_2004_10 ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c1 float not null, c2 float not null, c3 float not null, [BIG SNIP] c2499float not null, c2500float not null, primary key (granID, scanID, fpID) ) The first issue I'm having is indexing performance. Adding rows into the table comes in data granules, each of which adds just over 12,000 rows into this table. With the keys disabled, inserts take about 50 seconds. With keys enabled, the insert takes 120-125 seconds. I can escape some of the pain by just not promoting a table for the users (adding it to the larger merge tables that the users issue queries against) until the entire month's data has been added, and then enabling the keys. On a table with a full month's data, ALTER TABLE raddata_2004_10 ENABLE KEYS takes about 22 hours to complete. Fortunately, its a write-once setup, for the most part, and most of the work of the database engine is selecting/subsetting the data once it's in. So, is there a faster way to insert/index the data? Would a different table or index type improve performace? The second issue is query performance. It seems that regardless of what fields are selected, it reads the entire row? Since a monthly table averages 840GB, this takes a while, even on a well-organized query like 'Select c1 from raddata_2004_10 where granID between 147568 and 15' (granID tracks the granules from the start of the mission, and is a calculated value based on the date/time of the data in the file -- incrementing over time. The 2500 granules in the example above represent about 1/3 of a month -- 10 days or so). Even making sure it's using the index to do the query (which it does anyhow, for the above query, according to explain), the current behavior makes it reads 29548800 complete rows, which is 275.4 GB of data to read, even though the desired return is about 1/2500th of that (112.7 MB). All of these tables are fixed-format rows. Is there a configuration or code tweak that would enable MySQL to just read the requested data pieces, or even a smaller subset of each row? With a fixed row format, each field offset should be a relatively simple calculation within a row. Any/all suggestions, comments, even flames are welcoome :) Thanks in advance! ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Partial Row Reads?
Quoting [EMAIL PROTECTED]: Look at some my.cnf options. You can tell mysql to use keys more often the table scans with a var called max_seeks_keys=100 // something like that Definitely. In fact, that's not really the issue at hand, since max_seeks_for_key is already set to 1000 here. Shawn hit the nail right on the head for the problem, below. I don't think that he is worried about table scanning, he is worried about ROW scanning. Each of his rows is so large (2500*(size of float) + 3*(size of tinyint) + some other stuff) that just moving that much data around through his machine is consuming too much time. In the example case I posted earlier: 'Select c1 from raddata_2004_10 where granID between 147568 and 15' resolves to returning one column from each of 29548800 rows. Since the users are actually querying against merge tables of the monthly data sets, the problem is compounded quickly. We are currently working with 2 years of data, with another 3-5 expected in due time, and more if we get lucky with instrument lifetime. Doing full row scans, even for queries where the rows are really limited with the WHERE constraints, is time prohibitive. My idea is to create a way to ask the engine check the value of the second field directly from the DISK copy of each table (without moving each row into local memory) . If it matches a second seek is performed to pull in the field(s) specified in his select . Alternatively we could create some way that we can ask the engine to only pull those columns through memory that either participate in the SELECT clause or one of the other ON or WHERE conditions. This way we minimize how much data must be moved through memory to resolve one of these rather bulky queries. We could potentially create a new threshold value, say... if you use less than 10% of the fields on a table in a query , that would cause this optimization to kick in. If anyone else has any ideas (and I am sure there are many) on how to minimize disk traffic and memory throughput in order to handle his rather large dataset, this would be a great time to speak up. I know that this is a rather extreme case but solving this issue may make MySQL just that much faster for the rest of us. Sounds like a great approach. Ideally, with fixed-format rows, any given column should be readily addressable with a couple of quick calculations. Assuming that the indexes provide an offset to the matching row(s), the column offset should be simple addition from there, just the total of the lengths of the previous columns (plus null flags and other formatting stuff, if those are interspersed with the column data in the record). Row fragmentation would confuse the issue a bit more, but according to the manual, fixed-format rows shouldn't be subject to that. This is definitely an edge case =) One of the reasons we're experimenting with a database solution is that the raw data files are already too much to manage efficiently, particularly when it comes to subsetting the data based on individual location, time, or detector values. Thanks for all the input! I appreciate it deeply. ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Prepared Statement questions/issues
Hi Folks -- I'm running into some issues with large prepared statements, and would love some feedback about approaches or tweaks that I might have overlooked :) I'm working on a large-scale database for scientific data -- lots of individual detector values and whatnot, that unfortunately need to be independently addressable in queries. Makes it more fun, I guess :) In any case, the table structure I'm working with looks like: create table raddata ( granIDinteger not null, scanIDtinyint unsigned not null, fpID tinyint not null, c1smallint, c2smallint, c3smallint, [and so on and so forth] c2376 smallint, c2377 smallint, c2378 smallint, primary key (fpID, scanID, granID) ); Overall, it works out to 2381 fields, with an aggregate length of 5058 bytes per row. The data I'm working with comes from instrument data files, which I've pulled apart and stuffed into C structures that look fairly similar: typedef struct { int granID; u_char scanID; charfpID; short channels[2378]; } radRow; From our instrument data, I get about 3 million of these rows a day, and its constantly gathering data. Each of my data granules has 12150 rows for this table, and inserts are pretty slow. Using single prepared statements, the row inserts take about 70 seconds, from the in-memory structs. I've tried globbing them together by doing multiple row inserts (INSERT into raddata values (), (.)), and have found some interesting limitations. I can do 3 rows worth of insert in one statement (15174 bytes of data, query string with ?'s is 14336 bytes). 4+ rows (20232 bytes of data, query string with ?'s for params is 19100 bytes) causes a SIGBUS, and the ingest app dies in a ball of fire. The table is locked for writes before any inserts, and unlocked immediately after, so I don't think its thread contention for the table. Using the groups of 3 rows per insert is actually a little slower, running 75-80 seconds. I'm currently running both the ingest app and the mysql server (4.1.3-beta, built from source) on a Sun Ultra 60 running Solaris 9. Any suggestions as to how to do this more efficiently/faster? Or how to increase the buffer space for the prepared statements, so as to do the inserts in larger groups? Thanks in advance! Ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
count() on multiple similar tables?
Heya folks -- Trying to come up with a way to count across multiple tables, and failing miserably. I need a simple way, preferably better than looped queries, of summing the number of rows in multiple tables. Example: I have multiple tables with the same column layout, due to the amount of data expected to land in each one, broken up by year/month. This is simplified some, but should serve to illustrate the issue. create table info_2004_03 ( itemID integer auto_increment NOT NULL, eventID integer NOT NULL, eventNamechar(40), primary key (itemID) ); create table info_2004_04 ( itemID integer auto_increment NOT NULL, eventID integer NOT NULL, eventNamechar(40), primary key (itemID) ); I need to get a total number of itemIDs over all the info_ tables. I could just query each of the tables individually and add the results, but looking for a more graceful way of doing it, hopefully. Thanks in advance! ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administratorhttp://www.endlessknot.com/~ken Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: count() on multiple similar tables?
Great! Thanks for the quick pointer! ken Quoting Diana Soares [EMAIL PROTECTED]: You may use MERGE TABLES: http://dev.mysql.com/doc/mysql/en/MERGE.html -- Diana Soares On Tue, 2004-05-04 at 10:08, Ken Gieselman wrote: Heya folks -- Trying to come up with a way to count across multiple tables, and failing miserably. I need a simple way, preferably better than looped queries, of summing the number of rows in multiple tables. Example: I have multiple tables with the same column layout, due to the amount of data expected to land in each one, broken up by year/month. This is simplified some, but should serve to illustrate the issue. create table info_2004_03 ( itemID integer auto_increment NOT NULL, eventID integer NOT NULL, eventNamechar(40), primary key (itemID) ); create table info_2004_04 ( itemID integer auto_increment NOT NULL, eventID integer NOT NULL, eventNamechar(40), primary key (itemID) ); I need to get a total number of itemIDs over all the info_ tables. I could just query each of the tables individually and add the results, but looking for a more graceful way of doing it, hopefully. Thanks in advance! ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~ken Endlessknot Communications http://www.endlessknot.com === === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administratorhttp://www.endlessknot.com/~ken Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Reading Binary Data
Hi Folks -- Been working on this issue for a while, and not making much headway, so I thought I'd throw it out to the list and see what wisdom came back :) Problem: I'm trying to find a way to store LOTS of scientific data (mostly 32-bit integers and floating point numbers) in a database for analysis work. The instrument(s) we're working with generate very large volumes of numeric data, roughly 30GB per day. What I'd like to do is have a method read through the data files produced, store the data in tables, making it easy to subset and index the data. OK, doesn't sound too bad off the bat. Here's the issue: Is there a way to store data in BINARY numeric formats into the database (without converting to strings first)? In two or three of my tables, rows contain 2500-3000 individual numbers, which would be a logistical nightmare to throw into a query string, not to mention potential loss of accuracy from rouding errors in floating point numbers. I thought that LOAD DATA INFILE was the next solution, but it appears to, when using fixed-row format, is that it uses the DISPLAY sizes of the fields, rather than the byte storage requirements for them. Anyone else out there dealing with large volumes of numeric data? Thanks in advance, Ken === Diplomacy is the weapon of the Civilized Warrior - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]