Re: Partial Row Reads?

2004-10-25 Thread Tom Cunningham
I think what Harald is saying ( if he's not, then I say this):

You could have an alternative table structure like this: - it should
make queries much quicker:


create table raddata_2004_10_ONE (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 c1   float not null,
...
 c500float not null,
)

create table raddata_2004_10_TWO (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 c501   float not null,
...
 c1000float not null,
)

But maybe with a more logical way of partitioning the columns among
the different tables.

Or, another option, perhaps you've thought of, have a table like this:
should make indexing much worse, but querying much better.

create table raddata_2004_10_TWO (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 cID   smallint not null,
 cValue   float not null,
  primary key (granID, scanID, fpID, cID)
  index (granID, cID)
)

OLD QUERY:
Select c1 from raddata_2004_10 where granID between 147568 and 15

NEW QUERY:
Select cValue from raddata_2004_10 where granID between 147568 and
15 and cID=1;
(should be v. fast)

--- incidentally: I have a question: when you have a table like this
with a primary key which has a lot of columns, is there any
performance benefit to adding a new primary key, as an auto-increment
column,  keeping the old primary key as a unique index? I thought
maybe there might be some addressing / hashing issues which worked out
quicker?

Tom.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Partial Row Reads?

2004-10-25 Thread Ken Gieselman
Quoting Tom Cunningham [EMAIL PROTECTED]:
I think what Harald is saying ( if he's not, then I say this):
You could have an alternative table structure like this: - it should
make queries much quicker:
create table raddata_2004_10_ONE (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 c1   float not null,
...
 c500float not null,
)
create table raddata_2004_10_TWO (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 c501   float not null,
...
 c1000float not null,
)
But maybe with a more logical way of partitioning the columns among
the different tables.
Yeah, we looked at doing something like this.  It would make the tables 
smaller,
though queries would be much more painful.  Unfortunately, there's not really
any common pattern for access to the various detector values.  Each of the
people analyzing the data has a different set that they like to work with,
depending on the information they're trying to extract.  The merge 
tables would
also require MASSIVE numbers of filehandles, since it would add another 5x to
the eventual number of tables (all of the tables are aggregated over time via
merge tables for the users).

Or, another option, perhaps you've thought of, have a table like this:
should make indexing much worse, but querying much better.
create table raddata_2004_10_TWO (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 cID   smallint not null,
 cValue   float not null,
  primary key (granID, scanID, fpID, cID)
  index (granID, cID)
)
OLD QUERY:
Select c1 from raddata_2004_10 where granID between 147568 and 15
NEW QUERY:
Select cValue from raddata_2004_10 where granID between 147568 and
15 and cID=1;
(should be v. fast)
Interesting :)  It would make reads a lot faster, but I see a couple of
downsides.  First, it means that I'm duplicating the 6 bytes worth of 
ID fields
(plus the row/index overhead) a LOT more -- the main problem with this 
setup is
the volume of data that I'm trying to address.  With 2500 columns, I get 12150
rows per granule, or just short of 3 million rows a day.  With the structure
you suggested, I'd get 7.29 billion rows a day... wonder how the mysql engine
would handle 235 Billion rows per month in a table.  Add the extra overhead,
and I don't think we could manage the storage requirements (it works out to
just under 10TB a year as it is now, with the other tables) -- works out to
2.625 TB a month for just the one table this way, unfortunately.

--- incidentally: I have a question: when you have a table like this
with a primary key which has a lot of columns, is there any
performance benefit to adding a new primary key, as an auto-increment
column,  keeping the old primary key as a unique index? I thought
maybe there might be some addressing / hashing issues which worked out
quicker?
Interesting idea.  Not sure what the gain would be, at least in this case,
however.  Most of the queries are based on channel values, or other data like
geolocation or time, and just related to this table by the 3 id fields. 
 I'd be
willing to give it a shot, but not sure what an autoincrement field would gain
when the queries aren't based on insert-order.

Thanks for the insight!  Appreciate all the suggestions that you guys are
throwing into the hat!
ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.
Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


Re: Partial Row Reads?

2004-10-22 Thread Harald Fuchs
In article [EMAIL PROTECTED],
Ken Gieselman [EMAIL PROTECTED] writes:

 The second issue is query performance.  It seems that regardless of
 what fields are selected, it reads the entire row?  Since a monthly
 table averages 840GB, this takes a while, even on a well-organized
 query like 'Select c1 from raddata_2004_10 where granID between
 147568 and 15' (granID tracks the granules from the start of the
 mission, and is a calculated value based on the date/time of the
 data in the file -- incrementing over time.  The 2500 granules in
 the example above represent about 1/3 of a month -- 10 days or so).
 Even making sure it's using the index to do the query (which it does
 anyhow, for the above query, according to explain), the current
 behavior makes it reads 29548800 complete rows, which is 275.4 GB of
 data to read, even though the desired return is about 1/2500th of
 that (112.7 MB).

Is there a kind of access pattern?  Do you access c1 often in
conjunction with (e.g.) c2 and c3, but hardly ever with c4 or c5?  If
so, you might use vertical partitioning in addition to the horizontal
partitioning you're currently doing.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Partial Row Reads?

2004-10-21 Thread Ken Gieselman
Hi Folks --

Ran into a couple performance issues, and looking for some optimization tips :)

I'm currently using MySQL 4.1.5-gamma, built from the bitkeeper tree a month or
so ago.  I have a table which is roughly 2500 columns by 91 million rows (I get
4 of these a month, from the data we're receiving).  The tables are MyISAM, and
look roughly like:

create table raddata_2004_10 (
  granID   integer not null,
  scanID   tinyint unsigned not null,
  fpID tinyint not null,
  c1   float not null,
  c2   float not null,
  c3   float not null,
   [BIG SNIP]
  c2499float not null,
  c2500float not null,
  primary key (granID, scanID, fpID)
)

The first issue I'm having is indexing performance.  Adding rows into the table
comes in data granules, each of which adds just over 12,000 rows into this
table.  With the keys disabled, inserts take about 50 seconds.  With keys
enabled, the insert takes 120-125 seconds.  I can escape some of the pain by
just not promoting a table for the users (adding it to the larger merge
tables that the users issue queries against) until the entire month's data has
been added, and then enabling the keys.  On a table with a full month's data,
ALTER TABLE raddata_2004_10 ENABLE KEYS takes about 22 hours to complete. 
Fortunately, its a write-once setup, for the most part, and most of the work of
the database engine is selecting/subsetting the data once it's in.

So, is there a faster way to insert/index the data?  Would a different table or
index type improve performace?

The second issue is query performance.  It seems that regardless of what fields
are selected, it reads the entire row?  Since a monthly table averages 840GB,
this takes a while, even on a well-organized query like 'Select c1 from
raddata_2004_10 where granID between 147568 and 15' (granID tracks the
granules from the start of the mission, and is a calculated value based on the
date/time of the data in the file -- incrementing over time.  The 2500 granules
in the example above represent about 1/3 of a month -- 10 days or so).  Even
making sure it's using the index to do the query (which it does anyhow, for the
above query, according to explain), the current behavior makes it reads
29548800 complete rows, which is 275.4 GB of data to read, even though the
desired return is about 1/2500th of that (112.7 MB).

All of these tables are fixed-format rows.  Is there a configuration or code
tweak that would enable MySQL to just read the requested data pieces, or even a
smaller subset of each row?  With a fixed row format, each field offset should
be a relatively simple calculation within a row.

Any/all suggestions, comments, even flames are welcoome :)  Thanks in advance!

ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Partial Row Reads?

2004-10-21 Thread Dathan Vance Pattishall


DVP

Dathan Vance Pattishall http://www.friendster.com


 -Original Message-
 
 So, is there a faster way to insert/index the data?  Would a different
 table or
 index type improve performace?

Use Load data from infile .. IGNORE ...  u might get a better insert speed
increase. A different table and a different index / compound index would
improve performance from an insert to a select perspective.



 
 above query, according to explain), the current behavior makes it reads
 29548800 complete rows, which is 275.4 GB of data to read, even though the
 desired return is about 1/2500th of that (112.7 MB).

If a range covers more then 30% of the table a table scan is performed,
instead of an index scan.
 

 
 
 Any/all suggestions, comments, even flames are welcoome :)  Thanks in
 advance!
 
 ken

Look at some my.cnf options. You can tell mysql to use keys more often the
table scans with a var called max_seeks_keys=100 // something like that



 ==
 =
   Diplomacy is the weapon of the Civilized Warrior
 - Hun, A.T.
 
 Ken Gieselman
 [EMAIL PROTECTED]
 System Administrator
 http://www.endlessknot.com/~kgieselm
 Endlessknot Communications
 http://www.endlessknot.com
 ==
 =
 
 
 
 
 
 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/[EMAIL PROTECTED]


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Partial Row Reads?

2004-10-21 Thread SGreen
I don't think that he is worried about table scanning, he is worried about 
ROW scanning.  Each of his rows is so large (2500*(size of float) + 
3*(size of tinyint) + some other stuff) that just moving that much data 
around through his machine is consuming too much time. 

If you have a query that does something like this

SELECT onefield
FROM hugetable
WHERE indexed_column = 12
AND non_indexed = 6

Odds are that you will be doing an indexed search. Say that for his 
indexed_column he gets 24M rows that match its condition out of the 16B 
rows on his table (this is way less than 30%). The next thing the engine 
has to do is to LOAD each of those 24M rows (every single column) into 
memory so that the value of the non_indexed column can be compared to 6. 
In order to perform that second comparison, the memory bus, the hard 
drives, and anything else related to reading records will have to transfer 
275.4 GB of data (at least once) just so that he can get the value from 
the 1 column he specified in his SELECT statement out of each row that 
matches his two WHERE conditions.

My idea is to create a way to ask the engine check the value of the second 
field directly from the DISK copy of each table (without moving each row 
into local memory) . If it matches a second seek is performed to pull in 
the field(s) specified in his select .  Alternatively we could create some 
way that we can ask the engine to only pull those columns through memory 
that either participate in the SELECT clause or one of the other ON or 
WHERE conditions.  This way we minimize how much data must be moved 
through memory to resolve one of these rather bulky queries.

We could potentially create a new threshold value, say... if you use less 
than 10% of the fields on a table in a query , that would cause this 
optimization to kick in.

If anyone else has any ideas (and I am sure there are many) on how to 
minimize disk traffic and memory throughput in order to handle his rather 
large dataset, this would be a great time to speak up. I know that this is 
a rather extreme case but solving this issue may make MySQL just that much 
faster for the rest of us.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



Dathan Vance Pattishall [EMAIL PROTECTED] wrote on 10/21/2004 
04:00:10 PM:

 
 
 DVP
 
 Dathan Vance Pattishall http://www.friendster.com
 
 
  -Original Message-
  
  So, is there a faster way to insert/index the data?  Would a different
  table or
  index type improve performace?
 
 Use Load data from infile .. IGNORE ...  u might get a better insert 
speed
 increase. A different table and a different index / compound index would
 improve performance from an insert to a select perspective.
 
 
 
  
  above query, according to explain), the current behavior makes it 
reads
  29548800 complete rows, which is 275.4 GB of data to read, even though 
the
  desired return is about 1/2500th of that (112.7 MB).
 
 If a range covers more then 30% of the table a table scan is performed,
 instead of an index scan.
 
 
  
  
  Any/all suggestions, comments, even flames are welcoome :)  Thanks in
  advance!
  
  ken
 
 Look at some my.cnf options. You can tell mysql to use keys more often 
the
 table scans with a var called max_seeks_keys=100 // something like that
 
 
 
  
==
  =
Diplomacy is the weapon of the Civilized Warrior
  - Hun, A.T.
  
  Ken Gieselman
  [EMAIL PROTECTED]
  System Administrator
  http://www.endlessknot.com/~kgieselm
  Endlessknot Communications
  http://www.endlessknot.com
  
==
  =
  
  
  
  
  
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:
  http://lists.mysql.com/[EMAIL PROTECTED]
 
 
 -- 
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
 


RE: Partial Row Reads?

2004-10-21 Thread Ken Gieselman
Quoting [EMAIL PROTECTED]:
Look at some my.cnf options. You can tell mysql to use keys more often the
table scans with a var called max_seeks_keys=100 // something like that
Definitely.  In fact, that's not really the issue at hand, since
max_seeks_for_key is already set to 1000 here.  Shawn hit the nail 
right on the
head for the problem, below.

I don't think that he is worried about table scanning, he is worried about
ROW scanning.  Each of his rows is so large (2500*(size of float) +
3*(size of tinyint) + some other stuff) that just moving that much data
around through his machine is consuming too much time.
In the example case I posted earlier:
'Select c1 from raddata_2004_10 where granID between 147568 and 15'
resolves to returning one column from each of 29548800 rows.  Since the users
are actually querying against merge tables of the monthly data sets, the
problem is compounded quickly.  We are currently working with 2 years of data,
with another 3-5 expected in due time, and more if we get lucky with 
instrument
lifetime.  Doing full row scans, even for queries where the rows are really
limited with the WHERE constraints, is time prohibitive.

My idea is to create a way to ask the engine check the value of the second
field directly from the DISK copy of each table (without moving each row
into local memory) . If it matches a second seek is performed to pull in
the field(s) specified in his select .  Alternatively we could create some
way that we can ask the engine to only pull those columns through memory
that either participate in the SELECT clause or one of the other ON or
WHERE conditions.  This way we minimize how much data must be moved
through memory to resolve one of these rather bulky queries.
We could potentially create a new threshold value, say... if you use less
than 10% of the fields on a table in a query , that would cause this
optimization to kick in.
If anyone else has any ideas (and I am sure there are many) on how to
minimize disk traffic and memory throughput in order to handle his rather
large dataset, this would be a great time to speak up. I know that this is
a rather extreme case but solving this issue may make MySQL just that much
faster for the rest of us.
Sounds like a great approach.  Ideally, with fixed-format rows, any 
given column
should be readily addressable with a couple of quick calculations.  Assuming
that the indexes provide an offset to the matching row(s), the column offset
should be simple addition from there, just the total of the lengths of the
previous columns (plus null flags and other formatting stuff, if those are
interspersed with the column data in the record).  Row fragmentation would
confuse the issue a bit more, but according to the manual, fixed-format rows
shouldn't be subject to that.

This is definitely an edge case =)  One of the reasons we're 
experimenting with
a database solution is that the raw data files are already too much to manage
efficiently, particularly when it comes to subsetting the data based on
individual location, time, or detector values.

Thanks for all the input!  I appreciate it deeply.
ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.
Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


Re: Partial Row Reads?

2004-10-21 Thread Harrison
Hi,
On Thursday, October 21, 2004, at 04:40  PM, [EMAIL PROTECTED] wrote:
I don't think that he is worried about table scanning, he is worried 
about
ROW scanning.  Each of his rows is so large (2500*(size of float) +
3*(size of tinyint) + some other stuff) that just moving that much data
around through his machine is consuming too much time.

If you have a query that does something like this
SELECT onefield
FROM hugetable
WHERE indexed_column = 12
AND non_indexed = 6
Odds are that you will be doing an indexed search. Say that for his
indexed_column he gets 24M rows that match its condition out of the 16B
rows on his table (this is way less than 30%). The next thing the 
engine
has to do is to LOAD each of those 24M rows (every single column) into
memory so that the value of the non_indexed column can be compared to 
6.
In order to perform that second comparison, the memory bus, the hard
drives, and anything else related to reading records will have to 
transfer
275.4 GB of data (at least once) just so that he can get the value from
the 1 column he specified in his SELECT statement out of each row that
matches his two WHERE conditions.

My idea is to create a way to ask the engine check the value of the 
second
field directly from the DISK copy of each table (without moving each 
row
into local memory) . If it matches a second seek is performed to pull 
in
the field(s) specified in his select .  Alternatively we could create 
some
way that we can ask the engine to only pull those columns through 
memory
that either participate in the SELECT clause or one of the other ON or
WHERE conditions.  This way we minimize how much data must be moved
through memory to resolve one of these rather bulky queries.

We could potentially create a new threshold value, say... if you use 
less
than 10% of the fields on a table in a query , that would cause this
optimization to kick in.

If anyone else has any ideas (and I am sure there are many) on how to
minimize disk traffic and memory throughput in order to handle his 
rather
large dataset, this would be a great time to speak up. I know that 
this is
a rather extreme case but solving this issue may make MySQL just that 
much
faster for the rest of us.
The problem with the above is that you will increase disk seeks by one 
per row per column.  A disk seek is much more expensive than a 
continuous read.  The only case where I could see this optimization 
would be useful was if you had a single column being read on a fixed 
width table (then you could compute the initial seek offset, without 
adding any extra ones).  With 2 columns (which is your above example), 
it would be faster to do 1 disk seek + read the entire row (10k in this 
case, with 2500 float fields) than doing 2 disk seeks and reading 8 
bytes (2 columns).  Moving data thru CPU and memory is trivial in cost 
compared to doing an extra disk seek.

Regards,
Harrison
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]