file i/o operations...

2006-08-25 Thread bruce
hi...

i'm trying to determine which is the better way/approach to go. should an
app do a great deal of file i/o, or should it do a great deal of read/writes
to a mysql db...

my test app will create a number of spawned child processes, 1000's of
simultaneous processes, and each child process will create data. the data
will ultimately need to be inserted into a db.

Approach 1
---
if i have each child app write to a file, i'm going to have a serious hit on
the disk, for the file i/o, but i'm pretty sure Centos/RH could handle it.
(although, to be honest, i don't know if there's a limit to the number of
simultaneous file descriptors that the OS allows to be open at the same
time.) i'm assuming that the number is multiples of magnitudes more than the
number of simultaneous connections i can have with a db

i could then have a process/app collect the information from each output
file, writing the information to the db, and deleting the output files as
required.

Approach 2
--
i could have each child app write to a local db, with each child app,
waiting to get the next open db connection. this is limited, as i'd run into
the max connection limit for the db. i'd also have to implement a process to
get the information from the local db, to the master db. ..

Approach 3
---
i could have each child app write directly to the db.. the problem with this
approach is that the db has a max regarding the number of simultaneous
connections, based on system resources. this would be the cleanest
solution..


so... anybody have any thoughts/comments as to how one can essentially
accept 1000's-1's of simultaneous hits with an app...

i've been trying to find out if there's any kind of distributed
parent/child/tiered kind of app, where information/data is more or less
collected and received at the node level...

does anyone know of a way to create a distributed kind of db app, where i
can enter information into a db on a given server, and the information is
essentially pulled into the master server from the child server...



thanks

-bruce


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: file i/o operations...

2006-08-25 Thread Brent Baisley
Just getting that number of processes running I think would be a challenge. A setup I recently worked on runs a few hundred 
processes per box, and that kind of maxes out the CPU.


Approach 1, been there, done that. Too messy.

Approach 2, considered it, but you may end up with processes that never connect. You would need a queueing/scheduling mechanism. 
Essentially you would be trying to do what an OS does, manage resources to make sure every process gets it's turn.


Approach 3, what we currently use. The processes connect to the db, does a bulk insert and then disconnects. We decided to limit 
each process to blocks of 100. Inserting a single record at a time will quickly degrade. This setup actually moved the bottleneck 
from the database to the processes doing their job. When each process starts, it inserts a record into a table and gets it's id. The 
process then handles the autoincrement value. The unique id for each record is then the process id plus the increment value.


To really scale, you may want to look into the black hole table format. Essentially it's a black hole, nothing is saved so there 
really isn't much overhead. But you set it up to be replicated and a replication log is generated. An easy setup would be to have 
multiple tables on a master server, each table replicating a black hole table from another server. Then create a merge table 
encompassing the multiple tables for easy querying.

This is the next idea we are pursueing, so it may or may not work.

- Original Message - 
From: bruce [EMAIL PROTECTED]

To: mysql@lists.mysql.com
Sent: Friday, August 25, 2006 1:12 PM
Subject: file i/o operations...



hi...

i'm trying to determine which is the better way/approach to go. should an
app do a great deal of file i/o, or should it do a great deal of read/writes
to a mysql db...

my test app will create a number of spawned child processes, 1000's of
simultaneous processes, and each child process will create data. the data
will ultimately need to be inserted into a db.

Approach 1
---
if i have each child app write to a file, i'm going to have a serious hit on
the disk, for the file i/o, but i'm pretty sure Centos/RH could handle it.
(although, to be honest, i don't know if there's a limit to the number of
simultaneous file descriptors that the OS allows to be open at the same
time.) i'm assuming that the number is multiples of magnitudes more than the
number of simultaneous connections i can have with a db

i could then have a process/app collect the information from each output
file, writing the information to the db, and deleting the output files as
required.

Approach 2
--
i could have each child app write to a local db, with each child app,
waiting to get the next open db connection. this is limited, as i'd run into
the max connection limit for the db. i'd also have to implement a process to
get the information from the local db, to the master db. ..

Approach 3
---
i could have each child app write directly to the db.. the problem with this
approach is that the db has a max regarding the number of simultaneous
connections, based on system resources. this would be the cleanest
solution..


so... anybody have any thoughts/comments as to how one can essentially
accept 1000's-1's of simultaneous hits with an app...

i've been trying to find out if there's any kind of distributed
parent/child/tiered kind of app, where information/data is more or less
collected and received at the node level...

does anyone know of a way to create a distributed kind of db app, where i
can enter information into a db on a given server, and the information is
essentially pulled into the master server from the child server...



thanks

-bruce


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]




--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: file i/o operations...

2006-08-25 Thread William R. Mussatto
A couple of comments:
- Simultaneous connections can be increased, but at some point the user
than runs the mysqld process will run out of file handles it can allocate
(each table takes 2 or 3).
- If we are talking about a database server and test server being the same
box then what are you trying to test.  Once you exceed the number of
processors on the box, the OS will just queue up the various processes and
that will be the limit of scalablity. Unless you overlap real I/O with
computation there is not much gain beyond a certain point.  When you run
out of memory for processes, its page to disk time (not a pleasent site).

Not sure what you are testing here.

BTW: please expain the 'black hole table'.

Jut my $0.1 worth.

Bill

Brent Baisley said:
 Just getting that number of processes running I think would be a
 challenge. A setup I recently worked on runs a few hundred
 processes per box, and that kind of maxes out the CPU.

 Approach 1, been there, done that. Too messy.

 Approach 2, considered it, but you may end up with processes that never
 connect. You would need a queueing/scheduling mechanism.
 Essentially you would be trying to do what an OS does, manage resources to
 make sure every process gets it's turn.

 Approach 3, what we currently use. The processes connect to the db, does a
 bulk insert and then disconnects. We decided to limit
 each process to blocks of 100. Inserting a single record at a time will
 quickly degrade. This setup actually moved the bottleneck
 from the database to the processes doing their job. When each process
 starts, it inserts a record into a table and gets it's id. The
 process then handles the autoincrement value. The unique id for each
 record is then the process id plus the increment value.

 To really scale, you may want to look into the black hole table format.
 Essentially it's a black hole, nothing is saved so there
 really isn't much overhead. But you set it up to be replicated and a
 replication log is generated. An easy setup would be to have
 multiple tables on a master server, each table replicating a black hole
 table from another server. Then create a merge table
 encompassing the multiple tables for easy querying.
 This is the next idea we are pursueing, so it may or may not work.

 - Original Message -
 From: bruce [EMAIL PROTECTED]
 To: mysql@lists.mysql.com
 Sent: Friday, August 25, 2006 1:12 PM
 Subject: file i/o operations...


 hi...

 i'm trying to determine which is the better way/approach to go. should
 an
 app do a great deal of file i/o, or should it do a great deal of
 read/writes
 to a mysql db...

 my test app will create a number of spawned child processes, 1000's of
 simultaneous processes, and each child process will create data. the
 data
 will ultimately need to be inserted into a db.

 Approach 1
 ---
 if i have each child app write to a file, i'm going to have a serious
 hit on
 the disk, for the file i/o, but i'm pretty sure Centos/RH could handle
 it.
 (although, to be honest, i don't know if there's a limit to the number
 of
 simultaneous file descriptors that the OS allows to be open at the same
 time.) i'm assuming that the number is multiples of magnitudes more than
 the
 number of simultaneous connections i can have with a db

 i could then have a process/app collect the information from each output
 file, writing the information to the db, and deleting the output files
 as
 required.

 Approach 2
 --
 i could have each child app write to a local db, with each child app,
 waiting to get the next open db connection. this is limited, as i'd run
 into
 the max connection limit for the db. i'd also have to implement a
 process to
 get the information from the local db, to the master db. ..

 Approach 3
 ---
 i could have each child app write directly to the db.. the problem with
 this
 approach is that the db has a max regarding the number of simultaneous
 connections, based on system resources. this would be the cleanest
 solution..


 so... anybody have any thoughts/comments as to how one can essentially
 accept 1000's-1's of simultaneous hits with an app...

 i've been trying to find out if there's any kind of distributed
 parent/child/tiered kind of app, where information/data is more or less
 collected and received at the node level...

 does anyone know of a way to create a distributed kind of db app,
 where i
 can enter information into a db on a given server, and the information
 is
 essentially pulled into the master server from the child server...



 thanks

 -bruce


 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
 http://lists.mysql.com/[EMAIL PROTECTED]



 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com

RE: file i/o operations...

2006-08-25 Thread bruce
hi brent

here's what i'm playing around with...

i'm writing a very limited web parsing/scraping app... rather than do a
sequential process, that's time consuming.. i'ver created/tested a kind of
parallel app that quickly spawns a child app for each url i need to fetch.
this can quickly generate 1000s of child processes, each of which is
fetching a given page i know, this could easily kill a web server, and
the app limits the workload on the server.. however, since the app does
multiple (100s) of sites, the app can still generate 1000s of pages that are
being fetched.

at the same time, i have a network of servers, (10-20) each of which is
doing the same thing.. fetching pages.

so i have a need to create an architecture/structure to handle this mass of
information and to slam it into the db as fast as possible...

if i have a single central db, the apps will be waiting waaay too long to
get a connection..  if i have a separate db for each server, and have each
app(s) on the server write to the local db, then i'd have to have a process
that somehow collects the local db information, and writes it to the master
db.. doable, but this solution would also potentially have a wait, given
the max connection limit of the db.

so this is the dilema i'm facing.

in searching google/academic articles.. i haven't come across a solution for
this kind of issue...

in looking at other crawlers (lucene/nutch/etc...) can't figure out if these
apps have a solution that i can use.

the basic problem as i've stated, boils down to trying to accept as much
data as possible such that this aspect of the whole system isn't the
bottleneck

yeah, i know.. i'm greedy.. trying to download all of my required
information from a given site in 10-20 mins! as opposed to hours

-bruce



-Original Message-
From: Brent Baisley [mailto:[EMAIL PROTECTED]
Sent: Friday, August 25, 2006 1:45 PM
To: [EMAIL PROTECTED]; mysql@lists.mysql.com
Subject: Re: file i/o operations...


Just getting that number of processes running I think would be a challenge.
A setup I recently worked on runs a few hundred
processes per box, and that kind of maxes out the CPU.

Approach 1, been there, done that. Too messy.

Approach 2, considered it, but you may end up with processes that never
connect. You would need a queueing/scheduling mechanism.
Essentially you would be trying to do what an OS does, manage resources to
make sure every process gets it's turn.

Approach 3, what we currently use. The processes connect to the db, does a
bulk insert and then disconnects. We decided to limit
each process to blocks of 100. Inserting a single record at a time will
quickly degrade. This setup actually moved the bottleneck
from the database to the processes doing their job. When each process
starts, it inserts a record into a table and gets it's id. The
process then handles the autoincrement value. The unique id for each record
is then the process id plus the increment value.

To really scale, you may want to look into the black hole table format.
Essentially it's a black hole, nothing is saved so there
really isn't much overhead. But you set it up to be replicated and a
replication log is generated. An easy setup would be to have
multiple tables on a master server, each table replicating a black hole
table from another server. Then create a merge table
encompassing the multiple tables for easy querying.
This is the next idea we are pursueing, so it may or may not work.

- Original Message -
From: bruce [EMAIL PROTECTED]
To: mysql@lists.mysql.com
Sent: Friday, August 25, 2006 1:12 PM
Subject: file i/o operations...


 hi...

 i'm trying to determine which is the better way/approach to go. should an
 app do a great deal of file i/o, or should it do a great deal of
read/writes
 to a mysql db...

 my test app will create a number of spawned child processes, 1000's of
 simultaneous processes, and each child process will create data. the data
 will ultimately need to be inserted into a db.

 Approach 1
 ---
 if i have each child app write to a file, i'm going to have a serious hit
on
 the disk, for the file i/o, but i'm pretty sure Centos/RH could handle it.
 (although, to be honest, i don't know if there's a limit to the number of
 simultaneous file descriptors that the OS allows to be open at the same
 time.) i'm assuming that the number is multiples of magnitudes more than
the
 number of simultaneous connections i can have with a db

 i could then have a process/app collect the information from each output
 file, writing the information to the db, and deleting the output files as
 required.

 Approach 2
 --
 i could have each child app write to a local db, with each child app,
 waiting to get the next open db connection. this is limited, as i'd run
into
 the max connection limit for the db. i'd also have to implement a process
to
 get the information from the local db, to the master db. ..

 Approach 3
 ---
 i