Re: Result row count *without* buffering all results?

2005-09-19 Thread Ken Gieselman

Harrison Fisk wrote:

Hi Ken,

On Sep 18, 2005, at 3:53 PM, Ken Gieselman wrote:


Scott Gifford wrote:


Ken Gieselman [EMAIL PROTECTED] writes:
[...]


So, here's my question: How can I find out how many rows are being
returned by a query, without using mysql_stmt_store_result() to
buffer all the rows back to the client?


One straightforward way is to replace the SELECT clause with just
SELECT COUNT(*) and execute that query first.
-ScottG.



Yeah, that's my fall-back option -- though the thought of executing 
the query twice is a bit daunting.  Some of the tables run into 
billions of rows per year (the merge tables anyhow, the data tables 
are broken down by month to keep them from becoming totally 
unmanageable), and a multi-year search can take a while to grind out.


Seems to me that by the time the first query execution is done, the 
server should *know* exactly how many rows are in the result set -- 
just need to find the magic trick that allows the client to query that 
number, rather than counting for itself as it buffers them.



The problem is that your assumption isn't always correct.  The server 
doesn't always know how many rows are in a result set before it starts 
returning rows back to the client.  Imagine if you did a simple SELECT * 
FROM tbl WHERE unindexed_col = 5; statement.  The server certainly isn't 
going to read in and count the number of rows before beginning to send 
the rows back to the client, for the same reason that you don't want to 
do a store_result, it will take up too much resources.  The only way for 
the server to know 100%, is to count the rows as they are being returned 
to the client  (It could in theory do it for some statements, such as 
where it has to do a filesort, however an API that only sometimes worked 
based on the execution plan wouldn't be very useful) .


Regards,

Harrison



That makes a fair amount of sense :)  I guess I'd just envisioned it as having a 
list of pointers to matches or something similar internally.  OK, given the 
limitations, is there an effective method to *estimate* what's likely to come 
back?  I've played a bit with pulling the numbers out of an EXPLAIN SELECT... 
but it seems that they tend to come in low, if the query is using indexes, or 
high (all rows) if not.


Appreciate the input guys! Thanks a bunch.

ken
--
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman   [EMAIL PROTECTED]
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Result row count *without* buffering all results?

2005-09-18 Thread Ken Gieselman

Scott Gifford wrote:

Ken Gieselman [EMAIL PROTECTED] writes:

[...]



So, here's my question: How can I find out how many rows are being
returned by a query, without using mysql_stmt_store_result() to
buffer all the rows back to the client?



One straightforward way is to replace the SELECT clause with just
SELECT COUNT(*) and execute that query first.

-ScottG.



Yeah, that's my fall-back option -- though the thought of executing the query 
twice is a bit daunting.  Some of the tables run into billions of rows per year 
(the merge tables anyhow, the data tables are broken down by month to keep them 
from becoming totally unmanageable), and a multi-year search can take a while to 
grind out.


Seems to me that by the time the first query execution is done, the server 
should *know* exactly how many rows are in the result set -- just need to find 
the magic trick that allows the client to query that number, rather than 
counting for itself as it buffers them.


Thanks!
ken
--
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman   [EMAIL PROTECTED]
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Result row count *without* buffering all results?

2005-09-17 Thread Ken Gieselman

Hi Folks --

I'm working on a scientific data warehousing setup using MySQL (currently 
running 4.1.13 on SPARC/Solaris 9), and using the C prepared statements API for 
pulling numeric data in and out.


So, here's my question: How can I find out how many rows are being returned by a 
query, without using mysql_stmt_store_result() to buffer all the rows back to 
the client?  The problem with buffering it all back is that result sets are 
usually gigabytes, sometimes even terabytes -- there's just no way that most of 
the client systems can stuff that all into memory.  I need to find a decent way 
to break up the result sets into chunks, but doing that is tricky without 
knowing just how many rows are coming back :)


I've tried playing with SQL_CALC_FOUND_ROWS in the queries, but I don't think I 
can execute a second query via the same database connection before dealing with 
the entire result set from the main query.  Is there a good way to execute 
multiple queries in parallel via one connection?


Suggestions, questions, even flames welcome.  Thanks in advance!
ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman   [EMAIL PROTECTED]
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Complicated Subquery help

2005-03-01 Thread Ken Gieselman
Hiya Folks!

I'm struggling with a complicated subquery issue, and my SQL-fu isn't strong
enough to make a solution apparent.  Hopefully one of you experts can give me a
nudge in the right direction :)

The query utilizes data from just one table, which looks like:

idX  integer
idY  tinyint unsigned,
idZ  tinyint,
c1   smallint,
c2   smallint,
c3   smallint,
 ... [and so on] ...


What I'd like to do is select a set of the channel data (the c* fields) based on
rows where the standard deviation on a given channel is below a certain
threshold for the rows near that one, based on the values of the id fields.

Getting the standard deviation on a single field for a given row isn't too bad:

select std(c2330) from radiances where idZ between 44 and 46 and idX between 12
and 14, and idX = 7;

That query selects the standard deviation for channel 2330, in the 9
footprints around the current point I'm looking at.  Selecting the channels I
need to analyze based on that should just using that select inside an IF
statement.

The trick is, how to automate this to iterate over all the id* fields?  idX is
an auto_increment counter based on observation time, idY and idZ are tied to
specific observations within a given set, and are integer counters that vary
from 0-150 or so.

Is there a way to do this with one (or more) sql queries, or would I be better
off writing a specific program for doing the selection?

Thanks for the help!
ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
===





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Partial Row Reads?

2004-10-25 Thread Ken Gieselman
Quoting Tom Cunningham [EMAIL PROTECTED]:
I think what Harald is saying ( if he's not, then I say this):
You could have an alternative table structure like this: - it should
make queries much quicker:
create table raddata_2004_10_ONE (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 c1   float not null,
...
 c500float not null,
)
create table raddata_2004_10_TWO (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 c501   float not null,
...
 c1000float not null,
)
But maybe with a more logical way of partitioning the columns among
the different tables.
Yeah, we looked at doing something like this.  It would make the tables 
smaller,
though queries would be much more painful.  Unfortunately, there's not really
any common pattern for access to the various detector values.  Each of the
people analyzing the data has a different set that they like to work with,
depending on the information they're trying to extract.  The merge 
tables would
also require MASSIVE numbers of filehandles, since it would add another 5x to
the eventual number of tables (all of the tables are aggregated over time via
merge tables for the users).

Or, another option, perhaps you've thought of, have a table like this:
should make indexing much worse, but querying much better.
create table raddata_2004_10_TWO (
 granID   integer not null,
 scanID   tinyint unsigned not null,
 fpID tinyint not null,
 cID   smallint not null,
 cValue   float not null,
  primary key (granID, scanID, fpID, cID)
  index (granID, cID)
)
OLD QUERY:
Select c1 from raddata_2004_10 where granID between 147568 and 15
NEW QUERY:
Select cValue from raddata_2004_10 where granID between 147568 and
15 and cID=1;
(should be v. fast)
Interesting :)  It would make reads a lot faster, but I see a couple of
downsides.  First, it means that I'm duplicating the 6 bytes worth of 
ID fields
(plus the row/index overhead) a LOT more -- the main problem with this 
setup is
the volume of data that I'm trying to address.  With 2500 columns, I get 12150
rows per granule, or just short of 3 million rows a day.  With the structure
you suggested, I'd get 7.29 billion rows a day... wonder how the mysql engine
would handle 235 Billion rows per month in a table.  Add the extra overhead,
and I don't think we could manage the storage requirements (it works out to
just under 10TB a year as it is now, with the other tables) -- works out to
2.625 TB a month for just the one table this way, unfortunately.

--- incidentally: I have a question: when you have a table like this
with a primary key which has a lot of columns, is there any
performance benefit to adding a new primary key, as an auto-increment
column,  keeping the old primary key as a unique index? I thought
maybe there might be some addressing / hashing issues which worked out
quicker?
Interesting idea.  Not sure what the gain would be, at least in this case,
however.  Most of the queries are based on channel values, or other data like
geolocation or time, and just related to this table by the 3 id fields. 
 I'd be
willing to give it a shot, but not sure what an autoincrement field would gain
when the queries aren't based on insert-order.

Thanks for the insight!  Appreciate all the suggestions that you guys are
throwing into the hat!
ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.
Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


Partial Row Reads?

2004-10-21 Thread Ken Gieselman
Hi Folks --

Ran into a couple performance issues, and looking for some optimization tips :)

I'm currently using MySQL 4.1.5-gamma, built from the bitkeeper tree a month or
so ago.  I have a table which is roughly 2500 columns by 91 million rows (I get
4 of these a month, from the data we're receiving).  The tables are MyISAM, and
look roughly like:

create table raddata_2004_10 (
  granID   integer not null,
  scanID   tinyint unsigned not null,
  fpID tinyint not null,
  c1   float not null,
  c2   float not null,
  c3   float not null,
   [BIG SNIP]
  c2499float not null,
  c2500float not null,
  primary key (granID, scanID, fpID)
)

The first issue I'm having is indexing performance.  Adding rows into the table
comes in data granules, each of which adds just over 12,000 rows into this
table.  With the keys disabled, inserts take about 50 seconds.  With keys
enabled, the insert takes 120-125 seconds.  I can escape some of the pain by
just not promoting a table for the users (adding it to the larger merge
tables that the users issue queries against) until the entire month's data has
been added, and then enabling the keys.  On a table with a full month's data,
ALTER TABLE raddata_2004_10 ENABLE KEYS takes about 22 hours to complete. 
Fortunately, its a write-once setup, for the most part, and most of the work of
the database engine is selecting/subsetting the data once it's in.

So, is there a faster way to insert/index the data?  Would a different table or
index type improve performace?

The second issue is query performance.  It seems that regardless of what fields
are selected, it reads the entire row?  Since a monthly table averages 840GB,
this takes a while, even on a well-organized query like 'Select c1 from
raddata_2004_10 where granID between 147568 and 15' (granID tracks the
granules from the start of the mission, and is a calculated value based on the
date/time of the data in the file -- incrementing over time.  The 2500 granules
in the example above represent about 1/3 of a month -- 10 days or so).  Even
making sure it's using the index to do the query (which it does anyhow, for the
above query, according to explain), the current behavior makes it reads
29548800 complete rows, which is 275.4 GB of data to read, even though the
desired return is about 1/2500th of that (112.7 MB).

All of these tables are fixed-format rows.  Is there a configuration or code
tweak that would enable MySQL to just read the requested data pieces, or even a
smaller subset of each row?  With a fixed row format, each field offset should
be a relatively simple calculation within a row.

Any/all suggestions, comments, even flames are welcoome :)  Thanks in advance!

ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Partial Row Reads?

2004-10-21 Thread Ken Gieselman
Quoting [EMAIL PROTECTED]:
Look at some my.cnf options. You can tell mysql to use keys more often the
table scans with a var called max_seeks_keys=100 // something like that
Definitely.  In fact, that's not really the issue at hand, since
max_seeks_for_key is already set to 1000 here.  Shawn hit the nail 
right on the
head for the problem, below.

I don't think that he is worried about table scanning, he is worried about
ROW scanning.  Each of his rows is so large (2500*(size of float) +
3*(size of tinyint) + some other stuff) that just moving that much data
around through his machine is consuming too much time.
In the example case I posted earlier:
'Select c1 from raddata_2004_10 where granID between 147568 and 15'
resolves to returning one column from each of 29548800 rows.  Since the users
are actually querying against merge tables of the monthly data sets, the
problem is compounded quickly.  We are currently working with 2 years of data,
with another 3-5 expected in due time, and more if we get lucky with 
instrument
lifetime.  Doing full row scans, even for queries where the rows are really
limited with the WHERE constraints, is time prohibitive.

My idea is to create a way to ask the engine check the value of the second
field directly from the DISK copy of each table (without moving each row
into local memory) . If it matches a second seek is performed to pull in
the field(s) specified in his select .  Alternatively we could create some
way that we can ask the engine to only pull those columns through memory
that either participate in the SELECT clause or one of the other ON or
WHERE conditions.  This way we minimize how much data must be moved
through memory to resolve one of these rather bulky queries.
We could potentially create a new threshold value, say... if you use less
than 10% of the fields on a table in a query , that would cause this
optimization to kick in.
If anyone else has any ideas (and I am sure there are many) on how to
minimize disk traffic and memory throughput in order to handle his rather
large dataset, this would be a great time to speak up. I know that this is
a rather extreme case but solving this issue may make MySQL just that much
faster for the rest of us.
Sounds like a great approach.  Ideally, with fixed-format rows, any 
given column
should be readily addressable with a couple of quick calculations.  Assuming
that the indexes provide an offset to the matching row(s), the column offset
should be simple addition from there, just the total of the lengths of the
previous columns (plus null flags and other formatting stuff, if those are
interspersed with the column data in the record).  Row fragmentation would
confuse the issue a bit more, but according to the manual, fixed-format rows
shouldn't be subject to that.

This is definitely an edge case =)  One of the reasons we're 
experimenting with
a database solution is that the raw data files are already too much to manage
efficiently, particularly when it comes to subsetting the data based on
individual location, time, or detector values.

Thanks for all the input!  I appreciate it deeply.
ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.
Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


Prepared Statement questions/issues

2004-08-18 Thread Ken Gieselman
Hi Folks --

I'm running into some issues with large prepared statements, and would love some
feedback about approaches or tweaks that I might have overlooked :)

I'm working on a large-scale database for scientific data -- lots of individual
detector values and whatnot, that unfortunately need to be independently
addressable in queries.  Makes it more fun, I guess :)  In any case, the table
structure I'm working with looks like:

create table raddata (
  granIDinteger not null,
  scanIDtinyint unsigned not null,
  fpID  tinyint not null,
  c1smallint,
  c2smallint,
  c3smallint,
   [and so on and so forth]
  c2376 smallint,
  c2377 smallint,
  c2378 smallint,
  primary key (fpID, scanID, granID)
);

Overall, it works out to 2381 fields, with an aggregate length of 5058 bytes per
row.

The data I'm working with comes from instrument data files, which I've pulled
apart and stuffed into C structures that look fairly similar:

typedef struct {
int granID;
u_char  scanID;
charfpID;
short   channels[2378];
} radRow;

From our instrument data, I get about 3 million of these rows a day, and its
constantly gathering data.  Each of my data granules has 12150 rows for this
table, and inserts are pretty slow.

Using single prepared statements, the row inserts take about 70 seconds, from
the in-memory structs.  I've tried globbing them together by doing multiple row
inserts (INSERT into raddata values (), (.)), and have found some
interesting limitations.  I can do 3 rows worth of insert in one statement
(15174 bytes of data, query string with ?'s is 14336 bytes).  4+ rows (20232
bytes of data, query string with ?'s for params is 19100 bytes) causes a
SIGBUS, and the ingest app dies in a ball of fire.

The table is locked for writes before any inserts, and unlocked immediately
after, so I don't think its thread contention for the table.  Using the groups
of 3 rows per insert is actually a little slower, running 75-80 seconds.

I'm currently running both the ingest app and the mysql server (4.1.3-beta,
built from source) on a Sun Ultra 60 running Solaris 9.

Any suggestions as to how to do this more efficiently/faster?  Or how to
increase the buffer space for the prepared statements, so as to do the inserts
in larger groups?

Thanks in advance!

Ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman   [EMAIL PROTECTED]
Endlessknot Communications   http://www.endlessknot.com
===





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



count() on multiple similar tables?

2004-05-04 Thread Ken Gieselman

Heya folks --

Trying to come up with a way to count across multiple tables, and failing
miserably.  I need a simple way, preferably better than looped queries, of
summing the number of rows in multiple tables.

Example:

I have multiple tables with the same column layout, due to the amount of data
expected to land in each one, broken up by year/month.  This is simplified
some, but should serve to illustrate the issue.

create table info_2004_03 (
  itemID   integer auto_increment NOT NULL,
  eventID  integer NOT NULL,
  eventNamechar(40),
  primary key (itemID)
);

create table info_2004_04 (
  itemID   integer auto_increment NOT NULL,
  eventID  integer NOT NULL,
  eventNamechar(40),
  primary key (itemID)
);

I need to get a total number of itemIDs over all the info_ tables.  I could just
query each of the tables individually and add the results, but looking for a
more graceful way of doing it, hopefully.

Thanks in advance!

ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.
 
Ken Gieselman   [EMAIL PROTECTED]
System Administratorhttp://www.endlessknot.com/~ken
Endlessknot Communications   http://www.endlessknot.com
===



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: count() on multiple similar tables?

2004-05-04 Thread Ken Gieselman
Great!  Thanks for the quick pointer!
ken

Quoting Diana Soares [EMAIL PROTECTED]:

 You may use MERGE TABLES:
 
 http://dev.mysql.com/doc/mysql/en/MERGE.html
 
 -- 
 Diana Soares
 
 On Tue, 2004-05-04 at 10:08, Ken Gieselman wrote:
  Heya folks --
  
  Trying to come up with a way to count across multiple tables, and failing
  miserably.  I need a simple way, preferably better than looped queries, of
  summing the number of rows in multiple tables.
  
  Example:
  
  I have multiple tables with the same column layout, due to the amount of
 data
  expected to land in each one, broken up by year/month.  This is simplified
  some, but should serve to illustrate the issue.
  
  create table info_2004_03 (
itemID   integer auto_increment NOT NULL,
eventID  integer NOT NULL,
eventNamechar(40),
primary key (itemID)
  );
  
  create table info_2004_04 (
itemID   integer auto_increment NOT NULL,
eventID  integer NOT NULL,
eventNamechar(40),
primary key (itemID)
  );
  
  I need to get a total number of itemIDs over all the info_ tables.  I could
 just
  query each of the tables individually and add the results, but looking for
 a
  more graceful way of doing it, hopefully.
  
  Thanks in advance!
  
  ken
 
 ===
Diplomacy is the weapon of the Civilized Warrior
  - Hun, A.T.
   
  Ken Gieselman  
 [EMAIL PROTECTED]
  System Administrator   
 http://www.endlessknot.com/~ken
  Endlessknot Communications  
 http://www.endlessknot.com
 
 ===
 
 
 


===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.
 
Ken Gieselman   [EMAIL PROTECTED]
System Administratorhttp://www.endlessknot.com/~ken
Endlessknot Communications   http://www.endlessknot.com
===



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Reading Binary Data

2004-03-27 Thread Ken Gieselman
Hi Folks --

Been working on this issue for a while, and not making much headway, so I 
thought I'd throw it out to the list and see what wisdom came back :)

Problem:  I'm trying to find a way to store LOTS of scientific data (mostly
32-bit integers and floating point numbers) in a database for analysis work.
The instrument(s) we're working with generate very large volumes of numeric
data, roughly 30GB per day.

What I'd like to do is have a method read through the data files produced, 
store the data in tables, making it easy to subset and index the data.  OK, 
doesn't sound too bad off the bat.  

Here's the issue:  Is there a way to store data in BINARY numeric formats
into the database (without converting to strings first)?  

In two or three of my tables, rows contain 2500-3000 individual numbers, which 
would be a logistical nightmare to throw into a query string, not to mention potential 
loss of accuracy from rouding errors in floating point numbers.  
I thought that LOAD DATA INFILE was the next solution, but it appears to, when 
using fixed-row format, is that it uses the DISPLAY sizes of the fields, 
rather than the byte storage requirements for them.

Anyone else out there dealing with large volumes of numeric data?

Thanks in advance,
Ken
===
  Diplomacy is the weapon of the Civilized Warrior
- Hun, A.T.

Ken Gieselman  [EMAIL PROTECTED]
System Administrator   http://www.endlessknot.com/~kgieselm
Endlessknot Communications   http://www.endlessknot.com
===

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]