Re: Partial Row Reads?
Quoting Tom Cunningham <[EMAIL PROTECTED]>: I think what Harald is saying (& if he's not, then I say this): You could have an alternative table structure like this: - it should make queries much quicker: create table raddata_2004_10_ONE ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c1 float not null, ... c500float not null, ) create table raddata_2004_10_TWO ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c501 float not null, ... c1000float not null, ) But maybe with a more logical way of partitioning the columns among the different tables. Yeah, we looked at doing something like this. It would make the tables smaller, though queries would be much more painful. Unfortunately, there's not really any common pattern for access to the various detector values. Each of the people analyzing the data has a different set that they like to work with, depending on the information they're trying to extract. The merge tables would also require MASSIVE numbers of filehandles, since it would add another 5x to the eventual number of tables (all of the tables are aggregated over time via merge tables for the users). Or, another option, perhaps you've thought of, have a table like this: should make indexing much worse, but querying much better. create table raddata_2004_10_TWO ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, cID smallint not null, cValue float not null, primary key (granID, scanID, fpID, cID) index (granID, cID) ) OLD QUERY: Select c1 from raddata_2004_10 where granID between 147568 and 15 NEW QUERY: Select cValue from raddata_2004_10 where granID between 147568 and 15 and cID=1; (should be v. fast) Interesting :) It would make reads a lot faster, but I see a couple of downsides. First, it means that I'm duplicating the 6 bytes worth of ID fields (plus the row/index overhead) a LOT more -- the main problem with this setup is the volume of data that I'm trying to address. With 2500 columns, I get 12150 rows per granule, or just short of 3 million rows a day. With the structure you suggested, I'd get 7.29 billion rows a day... wonder how the mysql engine would handle 235 Billion rows per month in a table. Add the extra overhead, and I don't think we could manage the storage requirements (it works out to just under 10TB a year as it is now, with the other tables) -- works out to 2.625 TB a month for just the one table this way, unfortunately. ---> incidentally: I have a question: when you have a table like this with a primary key which has a lot of columns, is there any performance benefit to adding a new primary key, as an auto-increment column, & keeping the old primary key as a unique index? I thought maybe there might be some addressing / hashing issues which worked out quicker? Interesting idea. Not sure what the gain would be, at least in this case, however. Most of the queries are based on channel values, or other data like geolocation or time, and just related to this table by the 3 id fields. I'd be willing to give it a shot, but not sure what an autoincrement field would gain when the queries aren't based on insert-order. Thanks for the insight! Appreciate all the suggestions that you guys are throwing into the hat! ken === "Diplomacy is the weapon of the Civilized Warrior" - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Partial Row Reads?
I think what Harald is saying (& if he's not, then I say this): You could have an alternative table structure like this: - it should make queries much quicker: create table raddata_2004_10_ONE ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c1 float not null, ... c500float not null, ) create table raddata_2004_10_TWO ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, c501 float not null, ... c1000float not null, ) But maybe with a more logical way of partitioning the columns among the different tables. Or, another option, perhaps you've thought of, have a table like this: should make indexing much worse, but querying much better. create table raddata_2004_10_TWO ( granID integer not null, scanID tinyint unsigned not null, fpID tinyint not null, cID smallint not null, cValue float not null, primary key (granID, scanID, fpID, cID) index (granID, cID) ) OLD QUERY: Select c1 from raddata_2004_10 where granID between 147568 and 15 NEW QUERY: Select cValue from raddata_2004_10 where granID between 147568 and 15 and cID=1; (should be v. fast) ---> incidentally: I have a question: when you have a table like this with a primary key which has a lot of columns, is there any performance benefit to adding a new primary key, as an auto-increment column, & keeping the old primary key as a unique index? I thought maybe there might be some addressing / hashing issues which worked out quicker? Tom. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Partial Row Reads?
In article <[EMAIL PROTECTED]>, Ken Gieselman <[EMAIL PROTECTED]> writes: > The second issue is query performance. It seems that regardless of > what fields are selected, it reads the entire row? Since a monthly > table averages 840GB, this takes a while, even on a well-organized > query like 'Select c1 from raddata_2004_10 where granID between > 147568 and 15' (granID tracks the granules from the start of the > mission, and is a calculated value based on the date/time of the > data in the file -- incrementing over time. The 2500 granules in > the example above represent about 1/3 of a month -- 10 days or so). > Even making sure it's using the index to do the query (which it does > anyhow, for the above query, according to explain), the current > behavior makes it reads 29548800 complete rows, which is 275.4 GB of > data to read, even though the desired return is about 1/2500th of > that (112.7 MB). Is there a kind of access pattern? Do you access c1 often in conjunction with (e.g.) c2 and c3, but hardly ever with c4 or c5? If so, you might use vertical partitioning in addition to the horizontal partitioning you're currently doing. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Partial Row Reads?
Hi, On Thursday, October 21, 2004, at 04:40 PM, [EMAIL PROTECTED] wrote: I don't think that he is worried about table scanning, he is worried about ROW scanning. Each of his rows is so large (2500*(size of float) + 3*(size of tinyint) + some other stuff) that just moving that much data around through his machine is consuming too much time. If you have a query that does something like this SELECT onefield FROM hugetable WHERE indexed_column = 12 AND non_indexed = 6 Odds are that you will be doing an indexed search. Say that for his indexed_column he gets 24M rows that match its condition out of the 16B rows on his table (this is way less than 30%). The next thing the engine has to do is to LOAD each of those 24M rows (every single column) into memory so that the value of the non_indexed column can be compared to 6. In order to perform that second comparison, the memory bus, the hard drives, and anything else related to reading records will have to transfer 275.4 GB of data (at least once) just so that he can get the value from the 1 column he specified in his SELECT statement out of each row that matches his two WHERE conditions. My idea is to create a way to ask the engine check the value of the second field directly from the DISK copy of each table (without moving each row into local memory) . If it matches a second seek is performed to pull in the field(s) specified in his select . Alternatively we could create some way that we can ask the engine to only pull those columns through memory that either participate in the SELECT clause or one of the other ON or WHERE conditions. This way we minimize how much data must be moved through memory to resolve one of these rather bulky queries. We could potentially create a new threshold value, say... if you use less than 10% of the fields on a table in a query , that would cause this optimization to kick in. If anyone else has any ideas (and I am sure there are many) on how to minimize disk traffic and memory throughput in order to handle his rather large dataset, this would be a great time to speak up. I know that this is a rather extreme case but solving this issue may make MySQL just that much faster for the rest of us. The problem with the above is that you will increase disk seeks by one per row per column. A disk seek is much more expensive than a continuous read. The only case where I could see this optimization would be useful was if you had a single column being read on a fixed width table (then you could compute the initial seek offset, without adding any extra ones). With 2 columns (which is your above example), it would be faster to do 1 disk seek + read the entire row (10k in this case, with 2500 float fields) than doing 2 disk seeks and reading 8 bytes (2 columns). Moving data thru CPU and memory is trivial in cost compared to doing an extra disk seek. Regards, Harrison -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Partial Row Reads?
Quoting [EMAIL PROTECTED]: Look at some my.cnf options. You can tell mysql to use keys more often the table scans with a var called max_seeks_keys=100 // something like that Definitely. In fact, that's not really the issue at hand, since max_seeks_for_key is already set to 1000 here. Shawn hit the nail right on the head for the problem, below. I don't think that he is worried about table scanning, he is worried about ROW scanning. Each of his rows is so large (2500*(size of float) + 3*(size of tinyint) + some other stuff) that just moving that much data around through his machine is consuming too much time. In the example case I posted earlier: 'Select c1 from raddata_2004_10 where granID between 147568 and 15' resolves to returning one column from each of 29548800 rows. Since the users are actually querying against merge tables of the monthly data sets, the problem is compounded quickly. We are currently working with 2 years of data, with another 3-5 expected in due time, and more if we get lucky with instrument lifetime. Doing full row scans, even for queries where the rows are really limited with the WHERE constraints, is time prohibitive. My idea is to create a way to ask the engine check the value of the second field directly from the DISK copy of each table (without moving each row into local memory) . If it matches a second seek is performed to pull in the field(s) specified in his select . Alternatively we could create some way that we can ask the engine to only pull those columns through memory that either participate in the SELECT clause or one of the other ON or WHERE conditions. This way we minimize how much data must be moved through memory to resolve one of these rather bulky queries. We could potentially create a new threshold value, say... if you use less than 10% of the fields on a table in a query , that would cause this optimization to kick in. If anyone else has any ideas (and I am sure there are many) on how to minimize disk traffic and memory throughput in order to handle his rather large dataset, this would be a great time to speak up. I know that this is a rather extreme case but solving this issue may make MySQL just that much faster for the rest of us. Sounds like a great approach. Ideally, with fixed-format rows, any given column should be readily addressable with a couple of quick calculations. Assuming that the indexes provide an offset to the matching row(s), the column offset should be simple addition from there, just the total of the lengths of the previous columns (plus null flags and other formatting stuff, if those are interspersed with the column data in the record). Row fragmentation would confuse the issue a bit more, but according to the manual, fixed-format rows shouldn't be subject to that. This is definitely an edge case =) One of the reasons we're experimenting with a database solution is that the raw data files are already too much to manage efficiently, particularly when it comes to subsetting the data based on individual location, time, or detector values. Thanks for all the input! I appreciate it deeply. ken === "Diplomacy is the weapon of the Civilized Warrior" - Hun, A.T. Ken Gieselman [EMAIL PROTECTED] System Administrator http://www.endlessknot.com/~kgieselm Endlessknot Communications http://www.endlessknot.com === -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Partial Row Reads?
I don't think that he is worried about table scanning, he is worried about ROW scanning. Each of his rows is so large (2500*(size of float) + 3*(size of tinyint) + some other stuff) that just moving that much data around through his machine is consuming too much time. If you have a query that does something like this SELECT onefield FROM hugetable WHERE indexed_column = 12 AND non_indexed = 6 Odds are that you will be doing an indexed search. Say that for his indexed_column he gets 24M rows that match its condition out of the 16B rows on his table (this is way less than 30%). The next thing the engine has to do is to LOAD each of those 24M rows (every single column) into memory so that the value of the non_indexed column can be compared to 6. In order to perform that second comparison, the memory bus, the hard drives, and anything else related to reading records will have to transfer 275.4 GB of data (at least once) just so that he can get the value from the 1 column he specified in his SELECT statement out of each row that matches his two WHERE conditions. My idea is to create a way to ask the engine check the value of the second field directly from the DISK copy of each table (without moving each row into local memory) . If it matches a second seek is performed to pull in the field(s) specified in his select . Alternatively we could create some way that we can ask the engine to only pull those columns through memory that either participate in the SELECT clause or one of the other ON or WHERE conditions. This way we minimize how much data must be moved through memory to resolve one of these rather bulky queries. We could potentially create a new threshold value, say... if you use less than 10% of the fields on a table in a query , that would cause this optimization to kick in. If anyone else has any ideas (and I am sure there are many) on how to minimize disk traffic and memory throughput in order to handle his rather large dataset, this would be a great time to speak up. I know that this is a rather extreme case but solving this issue may make MySQL just that much faster for the rest of us. Shawn Green Database Administrator Unimin Corporation - Spruce Pine "Dathan Vance Pattishall" <[EMAIL PROTECTED]> wrote on 10/21/2004 04:00:10 PM: > > > DVP > > Dathan Vance Pattishall http://www.friendster.com > > > > -Original Message- > > > > So, is there a faster way to insert/index the data? Would a different > > table or > > index type improve performace? > > Use Load data from infile .. IGNORE ... u might get a better insert speed > increase. A different table and a different index / compound index would > improve performance from an insert to a select perspective. > > > > > > > above query, according to explain), the current behavior makes it reads > > 29548800 complete rows, which is 275.4 GB of data to read, even though the > > desired return is about 1/2500th of that (112.7 MB). > > If a range covers more then 30% of the table a table scan is performed, > instead of an index scan. > > > > > > > > Any/all suggestions, comments, even flames are welcoome :) Thanks in > > advance! > > > > ken > > Look at some my.cnf options. You can tell mysql to use keys more often the > table scans with a var called max_seeks_keys=100 // something like that > > > > > == > > = > > "Diplomacy is the weapon of the Civilized Warrior" > > - Hun, A.T. > > > > Ken Gieselman > > [EMAIL PROTECTED] > > System Administrator > > http://www.endlessknot.com/~kgieselm > > Endlessknot Communications > > http://www.endlessknot.com > > == > > = > > > > > > > > > > > > -- > > MySQL General Mailing List > > For list archives: http://lists.mysql.com/mysql > > To unsubscribe: > > http://lists.mysql.com/[EMAIL PROTECTED] > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED] >
RE: Partial Row Reads?
DVP Dathan Vance Pattishall http://www.friendster.com > -Original Message- > > So, is there a faster way to insert/index the data? Would a different > table or > index type improve performace? Use Load data from infile .. IGNORE ... u might get a better insert speed increase. A different table and a different index / compound index would improve performance from an insert to a select perspective. > > above query, according to explain), the current behavior makes it reads > 29548800 complete rows, which is 275.4 GB of data to read, even though the > desired return is about 1/2500th of that (112.7 MB). If a range covers more then 30% of the table a table scan is performed, instead of an index scan. > > > Any/all suggestions, comments, even flames are welcoome :) Thanks in > advance! > > ken Look at some my.cnf options. You can tell mysql to use keys more often the table scans with a var called max_seeks_keys=100 // something like that > == > = > "Diplomacy is the weapon of the Civilized Warrior" > - Hun, A.T. > > Ken Gieselman > [EMAIL PROTECTED] > System Administrator > http://www.endlessknot.com/~kgieselm > Endlessknot Communications > http://www.endlessknot.com > == > = > > > > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe: > http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]