[PERFORM] Mass file imports

2003-07-21 Thread Greg Spiegelberg
Hello,

I'm hunting for some advice on loading 50,000+ files all less than
32KB to a 7.3.2 database.  The table is simple.
create table files (
 idint8 not null primary key,
 file  text not null,
 size  int8 not null,
 uid   int not null,
 raw   oid
);
The script (currently bash) pulls a TAR file out of a queue, unpacks it
to a large ramdisk mounted with noatime and performs a battery of tests
on the files included in the TAR file.  For each file in the TAR is will
add the following to a SQL file...
update files set raw=lo_import('/path/to/file/from/tar') where 
file='/path/to/file/from/tar';

This file begins with BEGIN; and ends with END; and is fed to Postgres
via a "psql -f sqlfile" command.  This part of the process can take
anywhere from 30 to over 90 minutes depending on the number of files
included in the TAR file.
System is a RedHat 7.3 running a current 2.4.20 RedHat kernel and
  dual PIII 1.4GHz
  2GB of memory
  512MB ramdisk (mounted noatime)
  mirrored internal SCSI160 10k rpm drives for OS and swap
  1 PCI 66MHz 64bit QLA2300
  1 Gbit SAN with several RAID5 LUN's on a Hitachi 9910
All filesystems are ext3.

Any thoughts?

Greg

--
Greg Spiegelberg
 Sr. Product Development Engineer
 Cranel, Incorporated.
 Phone: 614.318.4314
 Fax:   614.431.8388
 Email: [EMAIL PROTECTED]
Cranel. Technology. Integrity. Focus.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] opinion on RAID choice

2003-09-02 Thread Greg Spiegelberg
Ron Johnson wrote:
On Tue, 2003-09-02 at 11:14, Andrew Sullivan wrote:

On Thu, Aug 28, 2003 at 03:26:14PM -0600, scott.marlowe wrote:

My experience has been that once you get past 6 disks, RAID5 is faster 
than RAID1+0.
Also depends on your filesystem and volume manager.  As near as I can
tell, you do _not_ want to use RAID 5 with Veritas.


Why should Veritas care?  Or is it that Veritas has a high overhead
of small block writes?


I agree with Scott however only when it's hardware RAID 5 and only
certain hardware implementations of it.  A Sun A1000 RAID 5 is not
equal to a Sun T3.  Putting disk technologies aside, the A1000 array
XOR function is in software whereas the T3 is implemented in hardware.
Additionally, most external hardware based RAID systems have some
form of battery backup to ensure all data is written.
Veritas Volume Manager and even Linux, HP-UX and AIX LVM works just
fine when slicing & dicing but not for stitching LUN's together.  IMHO,
if you have the $$ for VxVM buy a hardware based RAID solution as well
and let it do the work.
Greg

--
Greg Spiegelberg
 Sr. Product Development Engineer
 Cranel, Incorporated.
 Phone: 614.318.4314
 Fax:   614.431.8388
 Email: [EMAIL PROTECTED]
Cranel. Technology. Integrity. Focus.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Compare rows

2003-10-08 Thread Greg Spiegelberg
All,

Anyone have any suggestions on how to efficiently compare
rows in the same table?  This table has 637 columns to be
compared and 642 total columns.
TIA,
Greg
--
Greg Spiegelberg
 Sr. Product Development Engineer
 Cranel, Incorporated.
 Phone: 614.318.4314
 Fax:   614.431.8388
 Email: [EMAIL PROTECTED]
Cranel. Technology. Integrity. Focus.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Compare rows

2003-10-08 Thread Greg Spiegelberg
Josh Berkus wrote:
Greg,


Anyone have any suggestions on how to efficiently compare
rows in the same table?  This table has 637 columns to be
compared and 642 total columns.


637 columns?   Are you sure that's normalized?   It's hard for me to conceive 
of a circumstance where that many columns would be necessary.

If this isn't a catastrophic normalization problem (which it sounds like), 
then you will probably still need to work through procedureal normalization 
code, as SQL simply doesn't offer any way around naming all the columns by 
hand.   Perhaps you could describe the problem in more detail?

The data represents metrics at a point in time on a system for
network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
speed, and whatever else can be gathered.
We arrived at this one 642 column table after testing the whole
process from data gathering, methods of temporarily storing then
loading to the database.  Initially, 37+ tables were in use but
the one big-un has saved us over 3.4 minutes.
The reason for my initial question was this.  We save changes only.
In other words, if system S has row T1 for day D1 and if on day D2
we have another row T1 (excluding our time column) we don't want
to save it.
That said, if the 3.4 minutes gets burned during our comparison which
saves changes only we may look at reverting to separate tables.  There
are only 1,700 to 3,000 rows on average per load.
Oh, PostgreSQL 7.3.3, PHP 4.3.1, RedHat 7.3, kernel 2.4.20-18.7smp,
2x1.4GHz PIII, 2GB memory, and 1Gbs SAN w/ Hitachi 9910 LUN's.
Greg

--
Greg Spiegelberg
 Sr. Product Development Engineer
 Cranel, Incorporated.
 Phone: 614.318.4314
 Fax:   614.431.8388
 Email: [EMAIL PROTECTED]
Cranel. Technology. Integrity. Focus.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]