Re: [PERFORM] SSD + RAID

2010-03-03 Thread Ron Mayer
Greg Smith wrote:
> Bruce Momjian wrote:
>> I always assumed SCSI disks had a write-through cache and therefore
>> didn't need a drive cache flush comment.

Some do.  SCSI disks have write-back caches.

Some have both(!) - a write-back cache but the user can explicitly
send write-through requests.

Microsoft explains it well (IMHO) here:
http://msdn.microsoft.com/en-us/library/aa508863.aspx
  "For example, suppose that the target is a SCSI device with
   a write-back cache. If the device supports write-through
   requests, the initiator can bypass the write cache by
   setting the force unit access (FUA) bit in the command
   descriptor block (CDB) of the write command."

> this perception, which I've recently come to believe isn't actually
> correct anymore.  ... I'm staring to think this is what
> we've all been observing rather than a write-through cache

I think what we've been observing is that guys with SCSI drives
are more likely to either
 (a) have battery-backed RAID controllers that insure writes succeed,
or
 (b) have other decent RAID controllers that understand details
 like that FUA bit to send write-through requests even if
 a SCSI devices has a write-back cache.

In contrast, most guys with PATA drives are probably running
software RAID (if any) with a RAID stack (older LVM and MD)
known to lose the cache flushing commands.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-03-02 Thread Pierre C



I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.


Maximum performance can only be reached with a writeback cache so the  
drive can reorder and cluster writes, according to the realtime position  
of the heads and platter rotation.


The problem is not the write cache itself, it is that, for your data to be  
safe, the "flush cache" or "barrier" command must get all the way through  
the application / filesystem to the hardware, going through a nondescript  
number of software/firmware/hardware layers, all of which may :


- not specify if they honor or ignore flush/barrier commands, and which  
ones

- not specify if they will reordre writes ignoring barriers/flushes or not
- have been written by people who are not aware of such issues
- have been written by companies who are perfectly aware of such issues  
but chose to ignore them to look good in benchmarks

- have some incompatibilities that result in broken behaviour
- have bugs

As far as I'm concerned, a configuration that doesn't properly respect the  
commands needed for data integrity is broken.


The sad truth is that given a software/hardware IO stack, there's no way  
to be sure, and testing isn't easy, if at all possible to do. Some cache  
flushes might be ignored under some circumstances.


For this to change, you don't need a hardware change, but a mentality  
change.


Flash filesystem developers use flash simulators which measure wear  
leveling, etc.


We'd need a virtual box with a simulated virtual harddrive which is able  
to check this.


What a mess.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-03-01 Thread Greg Smith

Bruce Momjian wrote:

I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.
  


There's more detail on all this mess at 
http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks and it includes 
this perception, which I've recently come to believe isn't actually 
correct anymore.  Like the IDE crowd, it looks like one day somebody 
said "hey, we lose every write heavy benchmark badly because we only 
have a write-through cache", and that principle got lost along the 
wayside.  What has been true, and I'm staring to think this is what 
we've all been observing rather than a write-through cache, is that the 
proper cache flushing commands have been there in working form for so 
much longer that it's more likely your SCSI driver and drive do the 
right thing if the filesystem asks them to.  SCSI SYNCHRONIZE CACHE has 
a much longer and prouder history than IDE's FLUSH_CACHE and SATA's 
FLUSH_CACHE_EXT.


It's also worth noting that many current SAS drives, the current SCSI 
incarnation, are basically SATA drives with a bridge chipset stuck onto 
them, or with just the interface board swapped out.  This one reason why 
top-end SAS capacities lag behind consumer SATA drives.  They use the 
consumers as beta testers to get the really fundamental firmware issues 
sorted out, and once things are stable they start stamping out the 
version with the SAS interface instead.  (Note that there's a parallel 
manufacturing approach that makes much smaller SAS drives, the 2.5" 
server models or those at higher RPMs, that doesn't go through this 
path.  Those are also the really expensive models, due to economy of 
scale issues).  The idea that these would have fundamentally different 
write cache behavior doesn't really follow from that development model.


At this point, there are only two common differences between "consumer" 
and "enterprise" hard drives of the same size and RPM when there are 
directly matching ones:


1) You might get SAS instead of SATA as the interface, which provides 
the more mature command set I was talking about above--and therefore may 
give you a sane write-back cache with proper flushing, which is all the 
database really expects.


2) The timeouts when there's a read/write problem are tuned down in the 
enterprise version, to be more compatible with RAID setups where you 
want to push the drive off-line when this happens rather than presuming 
you can fix it.  Consumers would prefer that the drive spent a lot of 
time doing heroics to try and save their sole copy of the apparently 
missing data.


You might get a slightly higher grade of parts if you're lucky too; I 
wouldn't count on it though.  That seems to be saved for the high RPM or 
smaller size drives only.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-03-01 Thread Bruce Momjian
Greg Smith wrote:
> Ron Mayer wrote:
> > Linux apparently sends FLUSH_CACHE commands to IDE drives in the
> > exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
> > drives[2].
> >   [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
> >   
> 
> Well, that's old enough to not even be completely right anymore about 
> SATA disks and kernels.  It's FLUSH_CACHE_EXT that's been added to ATA-6 
> to do the right thing on modern drives and that gets used nowadays, and 
> that doesn't necessarily do so on most of the SSDs out there; all of 
> which Bruce's recent doc additions now talk about correctly.
> 
> There's this one specific area we know about that the most popular 
> systems tend to get really wrong all the time; that's got the 
> appropriate warning now with the right magic keywords that people can 
> look into it more if motivated.  While it would be nice to get super 
> thorough and document everything, I think there's already more docs in 
> there than this project would prefer to have to maintain in this area.
> 
> Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI?  If the 
> idea is to be complete that's where this would go.  I don't know that 
> the documentation needs to address every possible way every possible 
> filesystem can be flushed. 

The bottom line is that the reason we have so much detailed
documentation about this is that mostly only database folks care about
such issues, so we end up having to research and document this
ourselves --- I don't see any alternatives.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-03-01 Thread Bruce Momjian
Ron Mayer wrote:
> Bruce Momjian wrote:
> > Greg Smith wrote:
> >> Bruce Momjian wrote:
> >>> I have added documentation about the ATAPI drive flush command, and the
> >>   
> >> If one of us goes back into that section one day to edit again it might 
> >> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
> >> that a drive needs to support properly.  I wouldn't bother with another 
> >> doc edit commit just for that specific part though, pretty obscure.
> > 
> > That setting name was not easy to find so I added it to the
> > documentation.
> 
> If we're spelling out specific IDE commands, it might be worth
> noting that the corresponding SCSI command is "SYNCHRONIZE CACHE"[1].
> 
> 
> Linux apparently sends FLUSH_CACHE commands to IDE drives in the
> exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
> drives[2].
> 
> It seems that the same file systems, SW raid layers,
> virtualization platforms, and kernels that have a problem
> sending FLUSH CACHE commands to SATA drives have he same exact
> same problems sending SYNCHRONIZE CACHE commands to SCSI drives.
> With the exact same effect of not getting writes all the way
> through disk caches.

I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-27 Thread Greg Smith

Ron Mayer wrote:

Linux apparently sends FLUSH_CACHE commands to IDE drives in the
exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
drives[2].
  [2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114
  


Well, that's old enough to not even be completely right anymore about 
SATA disks and kernels.  It's FLUSH_CACHE_EXT that's been added to ATA-6 
to do the right thing on modern drives and that gets used nowadays, and 
that doesn't necessarily do so on most of the SSDs out there; all of 
which Bruce's recent doc additions now talk about correctly.


There's this one specific area we know about that the most popular 
systems tend to get really wrong all the time; that's got the 
appropriate warning now with the right magic keywords that people can 
look into it more if motivated.  While it would be nice to get super 
thorough and document everything, I think there's already more docs in 
there than this project would prefer to have to maintain in this area.


Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI?  If the 
idea is to be complete that's where this would go.  I don't know that 
the documentation needs to address every possible way every possible 
filesystem can be flushed. 


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-27 Thread Ron Mayer
Bruce Momjian wrote:
> Greg Smith wrote:
>> Bruce Momjian wrote:
>>> I have added documentation about the ATAPI drive flush command, and the
>>   
>> If one of us goes back into that section one day to edit again it might 
>> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
>> that a drive needs to support properly.  I wouldn't bother with another 
>> doc edit commit just for that specific part though, pretty obscure.
> 
> That setting name was not easy to find so I added it to the
> documentation.

If we're spelling out specific IDE commands, it might be worth
noting that the corresponding SCSI command is "SYNCHRONIZE CACHE"[1].


Linux apparently sends FLUSH_CACHE commands to IDE drives in the
exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
drives[2].

It seems that the same file systems, SW raid layers,
virtualization platforms, and kernels that have a problem
sending FLUSH CACHE commands to SATA drives have he same exact
same problems sending SYNCHRONIZE CACHE commands to SCSI drives.
With the exact same effect of not getting writes all the way
through disk caches.

No?


[1] http://linux.die.net/man/8/sg_sync
[2] http://hardware.slashdot.org/comments.pl?sid=149349&cid=12519114

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-27 Thread Bruce Momjian
Greg Smith wrote:
> Bruce Momjian wrote:
> > I have added documentation about the ATAPI drive flush command, and the
> > typical SSD behavior.
> >   
> 
> If one of us goes back into that section one day to edit again it might 
> be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
> that a drive needs to support properly.  I wouldn't bother with another 
> doc edit commit just for that specific part though, pretty obscure.

That setting name was not easy to find so I added it to the
documentation.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-27 Thread Greg Smith

Bruce Momjian wrote:

I have added documentation about the ATAPI drive flush command, and the
typical SSD behavior.
  


If one of us goes back into that section one day to edit again it might 
be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
that a drive needs to support properly.  I wouldn't bother with another 
doc edit commit just for that specific part though, pretty obscure.


I find it kind of funny how many discussions run in parallel about even 
really detailed technical implementation details around the world.  For 
example, doesn't 
http://www.mail-archive.com/zfs-disc...@opensolaris.org/msg30585.html 
look exactly like the exchange between myself and Arjen the other day, 
referencing the same AnandTech page?


Could be worse; one of us could be the poor sap at 
http://opensolaris.org/jive/thread.jspa;jsessionid=41B679C30D136C059E1BB7C06CA7DCE0?messageID=397730 
who installed Windows XP, VirtualBox for Windows, an OpenSolaris VM 
inside of it, and then was shocked that cache flushes didn't make their 
way all the way through that chain and had his 10TB ZFS pool corrupted 
as a result.  Hurray for virtualization!


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-26 Thread Bruce Momjian

I have added documentation about the ATAPI drive flush command, and the
typical SSD behavior.

---

Greg Smith wrote:
> Ron Mayer wrote:
> > Bruce Momjian wrote:
> >   
> >> Agreed, thought I thought the problem was that SSDs lie about their
> >> cache flush like SATA drives do, or is there something I am missing?
> >> 
> >
> > There's exactly one case I can find[1] where this century's IDE
> > drives lied more than any other drive with a cache:
> 
> Ron is correct that the problem of mainstream SATA drives accepting the 
> cache flush command but not actually doing anything with it is long gone 
> at this point.  If you have a regular SATA drive, it almost certainly 
> supports proper cache flushing.  And if your whole software/storage 
> stacks understands all that, you should not end up with corrupted data 
> just because there's a volative write cache in there.
> 
> But the point of this whole testing exercise coming back into vogue 
> again is that SSDs have returned this negligent behavior to the 
> mainstream again.  See 
> http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion 
> of this in a ZFS context just last month.  There are many documented 
> cases of Intel SSDs that will fake a cache flush, such that the only way 
> to get good reliable writes is to totally disable their writes 
> caches--at which point performance is so bad you might as well have 
> gotten a RAID10 setup instead (and longevity is toast too).
> 
> This whole area remains a disaster area and extreme distrust of all the 
> SSD storage vendors is advisable at this point.  Basically, if I don't 
> see the capacitor responsible for flushing outstanding writes, and get a 
> clear description from the manufacturer how the cached writes are going 
> to be handled in the event of a power failure, at this point I have to 
> assume the answer is "badly and your data will be eaten".  And the 
> prices for SSDs that meet that requirement are still quite steep.  I 
> keep hoping somebody will address this market at something lower than 
> the standard "enterprise" prices.  The upcoming SandForce designs seem 
> to have thought this through correctly:  
> http://www.anandtech.com/storage/showdoc.aspx?i=3702&p=6  But the 
> product's not out to the general public yet (just like the Seagate units 
> that claim to have capacitor backups--I heard a rumor those are also 
> Sandforce designs actually, so they may be the only ones doing this 
> right and aiming at a lower price).
> 
> -- 
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
> g...@2ndquadrant.com   www.2ndQuadrant.us
> 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.62
diff -c -c -r1.62 wal.sgml
*** doc/src/sgml/wal.sgml	20 Feb 2010 18:28:37 -	1.62
--- doc/src/sgml/wal.sgml	27 Feb 2010 01:37:03 -
***
*** 59,66 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure.  Many solid-state drives also have volatile write-back
!caches.  To check write caching on Linux use
 hdparm -I;  it is enabled if there is a * next
 to Write cache; hdparm -W to turn off
 write caching.  On FreeBSD use
--- 59,69 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure, though ATAPI-6 introduced a drive cache
!flush command that some file systems use, e.g. ZFS.
!Many solid-state drives also have volatile write-back
!caches, and many do not honor cache flush commands by default.
!To check write caching on Linux use
 hdparm -I;  it is enabled if there is a * next
 to Write cache; hdparm -W to turn off
 write caching.  On FreeBSD use

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-24 Thread Dave Crooke
It's always possible to rebuild into a consistent configuration by assigning
a precedence order; for parity RAID, the data drives take precedence over
parity drives, and for RAID-1 sets it assigns an arbitrary master.

You *should* never lose a whole stripe ... for example, RAID-5 updates do
"read old data / parity, write new data, write new parity" ... there is no
need to touch any other data disks, so they will be preserved through the
rebuild. Similarly, if only one block is being updated there is no need to
update the entire stripe.

David - what caused /dev/md to decide to take an array offline?

Cheers
Dave

On Tue, Feb 23, 2010 at 3:22 PM,  wrote:

> On Tue, 23 Feb 2010, Aidan Van Dyk wrote:
>
>  * da...@lang.hm  [100223 15:05]:
>>
>>  However, one thing that you do not get protection against with software
>>> raid is the potential for the writes to hit some drives but not others.
>>> If this happens the software raid cannot know what the correct contents
>>> of the raid stripe are, and so you could loose everything in that stripe
>>> (including contents of other files that are not being modified that
>>> happened to be in the wrong place on the array)
>>>
>>
>> That's for stripe-based raid.  Mirror sets like raid-1 should give you
>> either the old data, or the new data, both acceptable responses since
>> the fsync/barreir hasn't "completed".
>>
>> Or have I missed another subtle interaction?
>>
>
> one problem is that when the system comes back up and attempts to check the
> raid array, it is not going to know which drive has valid data. I don't know
> exactly what it does in that situation, but this type of error in other
> conditions causes the system to take the array offline.
>
>
> David Lang
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] SSD + RAID

2010-02-23 Thread Mark Mielke

On 02/23/2010 04:22 PM, da...@lang.hm wrote:

On Tue, 23 Feb 2010, Aidan Van Dyk wrote:


* da...@lang.hm  [100223 15:05]:


However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others.
If this happens the software raid cannot know what the correct contents
of the raid stripe are, and so you could loose everything in that 
stripe

(including contents of other files that are not being modified that
happened to be in the wrong place on the array)


That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't "completed".

Or have I missed another subtle interaction?


one problem is that when the system comes back up and attempts to 
check the raid array, it is not going to know which drive has valid 
data. I don't know exactly what it does in that situation, but this 
type of error in other conditions causes the system to take the array 
offline.


I think the real concern here is that depending on how the data is read 
later - and depending on which disks it reads from - it could read 
*either* old or new, at any time in the future. I.e. it reads "new" from 
disk 1 the first time, and then an hour later it reads "old" from disk 2.


I think this concern might be invalid for a properly running system, 
though. When a RAID array is not cleanly shut down, the RAID array 
should run in "degraded" mode until it can be sure that the data is 
consistent. In this case, it should pick one drive, and call it the 
"live" one, and then rebuild the other from the "live" one. Until it is 
re-built, it should only satisfy reads from the "live" one, or parts of 
the "rebuilding" one that are known to be clean.


I use mdadm software RAID, and all of me reading (including some of its 
source code) and experience (shutting down the box uncleanly) tells me, 
it is working properly. In fact, the "rebuild" process can get quite 
ANNOYING as the whole system becomes much slower during rebuild, and 
rebuild of large partitions can take hours to complete.


For mdadm, there is a not-so-well-known "write-intent bitmap" 
capability. Once enabled, mdadm will embed a small bitmap (128 bits?) 
into the partition, and each bit will indicate a section of the 
partition. Before writing to a section, it will mark that section as 
dirty using this bitmap. It will leave this bit set for some time after 
the partition is "clean" (lazy clear). The effect of this, is that at 
any point in time, only certain sections of the drive are dirty, and on 
recovery, it is a lot cheaper to only rebuild the dirty sections. It 
works really well.


So, I don't think this has to be a problem. There are solutions, and any 
solution that claims to be complete should offer these sorts of 
capabilities.


Cheers,
mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-23 Thread david

On Tue, 23 Feb 2010, Aidan Van Dyk wrote:


* da...@lang.hm  [100223 15:05]:


However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others.
If this happens the software raid cannot know what the correct contents
of the raid stripe are, and so you could loose everything in that stripe
(including contents of other files that are not being modified that
happened to be in the wrong place on the array)


That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't "completed".

Or have I missed another subtle interaction?


one problem is that when the system comes back up and attempts to check 
the raid array, it is not going to know which drive has valid data. I 
don't know exactly what it does in that situation, but this type of error 
in other conditions causes the system to take the array offline.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-23 Thread Aidan Van Dyk
* da...@lang.hm  [100223 15:05]:

> However, one thing that you do not get protection against with software  
> raid is the potential for the writes to hit some drives but not others. 
> If this happens the software raid cannot know what the correct contents 
> of the raid stripe are, and so you could loose everything in that stripe  
> (including contents of other files that are not being modified that  
> happened to be in the wrong place on the array)

That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't "completed".

Or have I missed another subtle interaction?

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature


Re: [PERFORM] SSD + RAID

2010-02-23 Thread david

On Tue, 23 Feb 2010, da...@lang.hm wrote:


On Mon, 22 Feb 2010, Ron Mayer wrote:



Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.



I think I saw some stuff in the last few months on this issue on the kernel 
mailing list. you may want to doublecheck this when 2.6.33 gets released 
(probably this week)


to clarify further (after getting more sleep ;-)

I believe that the linux software raid always did the right thing if you 
did a fsync/fdatacync. however barriers that filesystems attempted to use 
to avoid the need for a hard fsync used to be silently ignored. I believe 
these are now honored (in at least some configurations)


However, one thing that you do not get protection against with software 
raid is the potential for the writes to hit some drives but not others. If 
this happens the software raid cannot know what the correct contents of 
the raid stripe are, and so you could loose everything in that stripe 
(including contents of other files that are not being modified that 
happened to be in the wrong place on the array)


If you have critical data, you _really_ want to use a raid controller with 
battery backup so that if you loose power you have a chance of eventually 
completing the write.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-23 Thread Scott Carey

On Feb 23, 2010, at 3:49 AM, Pierre C wrote:
> Now I wonder about something. SSDs use wear-leveling which means the  
> information about which block was written where must be kept somewhere.  
> Which means this information must be updated. I wonder how crash-safe and  
> how atomic these updates are, in the face of a power loss.  This is just  
> like a filesystem. You've been talking only about data, but the block  
> layout information (metadata) is subject to the same concerns. If the  
> drive says it's written, not only the data must have been written, but  
> also the information needed to locate that data...
> 
> Therefore I think the yank-the-power-cord test should be done with random  
> writes happening on an aged and mostly-full SSD... and afterwards, I'd be  
> interested to know if not only the last txn really committed, but if some  
> random parts of other stuff weren't "wear-leveled" into oblivion at the  
> power loss...
> 

A couple years ago I postulated that SSD's could do random writes fast if they 
remapped blocks.  Microsoft's SSD whitepaper at the time hinted at this too.
Persisting the remap data is not hard.  It goes in the same location as the 
data, or a separate area that can be written to linearly.

Each block may contain its LBA and a transaction ID or other atomic count.  Or 
another block can have that info.  When the SSD
powers up, it can build its table of LBA > block by looking at that data and 
inverting it and keeping the highest transaction ID for duplicate LBA claims.

Although SSD's have to ERASE data in a large block at a time (256K to 2M 
typically), they can write linearly to an erased block in much smaller chunks.
Thus, to commit a write, either:
Data, LBA tag, and txID in same block (may require oddly sized blocks).
or
Data written to one block (not committed yet), then LBA tag and txID written 
elsewhere (which commits the write).  Since its all copy on write, partial 
writes can't happen.
If a block is being moved or compressed when power fails data should never be 
lost since the old data still exists, the new version just didn't commit.  But 
new data that is being written may not be committed yet in the case of a power 
failure unless other measures are taken.

> 
> 
> 
> 
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-23 Thread Nikolas Everett
On Tue, Feb 23, 2010 at 6:49 AM, Pierre C  wrote:

>  Note that's power draw per bit.  dram is usually much more densely
>> packed (it can be with fewer transistors per cell) so the individual
>> chips for each may have similar power draws while the dram will be 10
>> times as densely packed as the sram.
>>
>
> Differences between SRAM and DRAM :
>
> [lots of informative stuff]
>

I've been slowly reading the paper at
http://people.redhat.com/drepper/cpumemory.pdf  which has a big section on
SRAM vs DRAM with nice pretty pictures. While not strictly relevant its been
illuminating and I wanted to share.


Re: [PERFORM] SSD + RAID

2010-02-23 Thread Pierre C

Note that's power draw per bit.  dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.


Differences between SRAM and DRAM :

- price per byte (DRAM much cheaper)

- silicon area per byte (DRAM much smaller)

- random access latency
   SRAM = fast, uniform, and predictable, usually 0/1 cycles
   DRAM = "a few" up to "a lot" of cycles depending on chip type,
   which page/row/column you want to access, wether it's R or W,
   wether the page is already open, etc

In fact, DRAM is the new harddisk. SRAM is used mostly when low-latency is  
needed (caches, etc).


- ease of use :
   SRAM very easy to use : address, data, read, write, clock.
   SDRAM needs a smart controller.
   SRAM easier to instantiate on a silicon chip

- power draw
   When used at high speeds, SRAM ist't power-saving at all, it's used for  
speed.

   However when not used, the power draw is really negligible.

While it is true that you can recover *some* data out of a SRAM/DRAM chip  
that hasn't been powered for a few seconds, you can't really trust that  
data. It's only a forensics tool.


Most DRAM now (especially laptop DRAM) includes special power-saving modes  
which only keep the data retention logic (refresh, etc) powered, but not  
the rest of the chip (internal caches, IO buffers, etc). Laptops, PDAs,  
etc all use this feature in suspend-to-RAM mode. In this mode, the power  
draw is higher than SRAM, but still pretty minimal, so a laptop can stay  
in suspend-to-RAM mode for days.


Anyway, the SRAM vs DRAM isn't really relevant for the debate of SSD data  
integrity. You can backup both with a small battery of ultra-cap.


What is important too is that the entire SSD chipset must have been  
designed with this in mind : it must detect power loss, and correctly  
react to it, and especially not reset itself or do funny stuff to the  
memory when the power comes back. Which means at least some parts of the  
chipset must stay powered to keep their state.


Now I wonder about something. SSDs use wear-leveling which means the  
information about which block was written where must be kept somewhere.  
Which means this information must be updated. I wonder how crash-safe and  
how atomic these updates are, in the face of a power loss.  This is just  
like a filesystem. You've been talking only about data, but the block  
layout information (metadata) is subject to the same concerns. If the  
drive says it's written, not only the data must have been written, but  
also the information needed to locate that data...


Therefore I think the yank-the-power-cord test should be done with random  
writes happening on an aged and mostly-full SSD... and afterwards, I'd be  
interested to know if not only the last txn really committed, but if some  
random parts of other stuff weren't "wear-leveled" into oblivion at the  
power loss...







--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-23 Thread david

On Mon, 22 Feb 2010, Ron Mayer wrote:



Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.



I think I saw some stuff in the last few months on this issue on the 
kernel mailing list. you may want to doublecheck this when 2.6.33 gets 
released (probably this week)


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Scott Marlowe
On Mon, Feb 22, 2010 at 7:21 PM, Scott Marlowe  wrote:
> On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith  wrote:
>> Mark Mielke wrote:
>>>
>>> I had read the above when posted, and then looked up SRAM. SRAM seems to
>>> suggest it will hold the data even after power loss, but only for a period
>>> of time. As long as power can restore within a few minutes, it seemed like
>>> this would be ok?
>>
>> The normal type of RAM everyone uses is DRAM, which requires constrant
>> "refresh" cycles to keep it working and is pretty power hungry as a result.
>>  Power gone, data gone an instant later.
>
> Actually, oddly enough, per bit stored dram is much lower power usage
> than sram, because it only has something like 2 transistors per bit,
> while sram needs something like 4 or 5 (it's been a couple decades
> since I took the classes on each).  Even with the constant refresh,
> dram has a lower power draw than sram.

Note that's power draw per bit.  dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Scott Marlowe
On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith  wrote:
> Mark Mielke wrote:
>>
>> I had read the above when posted, and then looked up SRAM. SRAM seems to
>> suggest it will hold the data even after power loss, but only for a period
>> of time. As long as power can restore within a few minutes, it seemed like
>> this would be ok?
>
> The normal type of RAM everyone uses is DRAM, which requires constrant
> "refresh" cycles to keep it working and is pretty power hungry as a result.
>  Power gone, data gone an instant later.

Actually, oddly enough, per bit stored dram is much lower power usage
than sram, because it only has something like 2 transistors per bit,
while sram needs something like 4 or 5 (it's been a couple decades
since I took the classes on each).  Even with the constant refresh,
dram has a lower power draw than sram.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Greg Smith

Mark Mielke wrote:
I had read the above when posted, and then looked up SRAM. SRAM seems 
to suggest it will hold the data even after power loss, but only for a 
period of time. As long as power can restore within a few minutes, it 
seemed like this would be ok?


The normal type of RAM everyone uses is DRAM, which requires constrant 
"refresh" cycles to keep it working and is pretty power hungry as a 
result.  Power gone, data gone an instant later.


There is also Non-volatile SRAM that includes an integrated battery ( 
http://www.maxim-ic.com/quick_view2.cfm/qv_pk/2648 is a typical 
example), and that sort of design can be used to build the sort of 
battery-backed caches that RAID controllers provide.  If Intel's drives 
were built using a NV-SRAM implementation, I'd be a happy owner of one 
instead of a constant critic of their drives.


But regular old SRAM is still completely volatile and loses its contents 
very quickly after power fails.  I doubt you'd even get minutes of time 
before it's gone.  The ease at which data loss failures with these Intel 
drives continue to be duplicated in the field says their design isn't 
anywhere near good enough to be considered non-volatile.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Greg Smith

Ron Mayer wrote:

I know less about other file systems.  Apparently the NTFS guys
are aware of such stuff - but don't know what kinds of fsync equivalent
you'd need to make it happen.
  


It's actually pretty straightforward--better than ext3.  Windows with 
NTFS has been perfectly aware how to do write-through on drives that 
support it when you execute _commit for some time: 
http://msdn.microsoft.com/en-us/library/17618685(VS.80).aspx


If you switch the postgresql.conf setting to fsync_writethrough on 
Windows, it will execute _commit where it would execute fsync on other 
platforms, and that pushes through the drive's caches as it should 
(unlike fsync in many cases).  More about this at 
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php and 
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm (which 
also covers OS X).


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Mark Mielke

On 02/22/2010 08:04 PM, Greg Smith wrote:

Arjen van der Meijden wrote:

That's weird. Intel's SSD's didn't have a write cache afaik:
"I asked Intel about this and it turns out that the DRAM on the Intel 
drive isn't used for user data because of the risk of data loss, 
instead it is used as memory by the Intel SATA/flash controller for 
deciding exactly where to write data (I'm assuming for the wear 
leveling/reliability algorithms)."

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10


Read further down:

"Despite the presence of the external DRAM, both the Intel controller 
and the JMicron rely on internal buffers to cache accesses to the 
SSD...Intel's controller has a 256KB SRAM on-die."


That's the problematic part:  the Intel controllers have a volatile 
256KB write cache stored deep inside the SSD controller, and issuing a 
standard SATA write cache flush command doesn't seem to clear it.  
Makes the drives troublesome for database use.


I had read the above when posted, and then looked up SRAM. SRAM seems to 
suggest it will hold the data even after power loss, but only for a 
period of time. As long as power can restore within a few minutes, it 
seemed like this would be ok?


I can understand a SSD might do unexpected things when it loses power 
all of a sudden. It will probably try to group writes to fill a 
single block (and those blocks vary in size but are normally way 
larger than those of a normal spinning disk, they are values like 256 
or 512KB) and it might loose that "waiting until a full block can be 
written"-data or perhaps it just couldn't complete a full block-write 
due to the power failure.
Although that behavior isn't really what you want, it would be 
incorrect to blame write caching for the behavior if the device 
doesn't even have a write cache ;)


If you write data and that write call returns before the data hits 
disk, it's a write cache, period.  And if that write cache loses its 
contents if power is lost, it's a volatile write cache that can cause 
database corruption.  The fact that the one on the Intel devices is 
very small, basically just dealing with the block chunking behavior 
you describe, doesn't change either of those facts.




The SRAM seems to suggest that it does not necessarily lose its contents 
if power is lost - it just doesn't say how long you have to plug it back 
in. Isn't this similar to a battery-backed cache or capacitor-backed cache?


I'd love to have a better guarantee - but is SRAM really such a bad model?

Cheers,
mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Greg Smith

Arjen van der Meijden wrote:

That's weird. Intel's SSD's didn't have a write cache afaik:
"I asked Intel about this and it turns out that the DRAM on the Intel 
drive isn't used for user data because of the risk of data loss, 
instead it is used as memory by the Intel SATA/flash controller for 
deciding exactly where to write data (I'm assuming for the wear 
leveling/reliability algorithms)."

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10


Read further down:

"Despite the presence of the external DRAM, both the Intel controller 
and the JMicron rely on internal buffers to cache accesses to the 
SSD...Intel's controller has a 256KB SRAM on-die."


That's the problematic part:  the Intel controllers have a volatile 
256KB write cache stored deep inside the SSD controller, and issuing a 
standard SATA write cache flush command doesn't seem to clear it.  Makes 
the drives troublesome for database use.


I can understand a SSD might do unexpected things when it loses power 
all of a sudden. It will probably try to group writes to fill a single 
block (and those blocks vary in size but are normally way larger than 
those of a normal spinning disk, they are values like 256 or 512KB) 
and it might loose that "waiting until a full block can be 
written"-data or perhaps it just couldn't complete a full block-write 
due to the power failure.
Although that behavior isn't really what you want, it would be 
incorrect to blame write caching for the behavior if the device 
doesn't even have a write cache ;)


If you write data and that write call returns before the data hits disk, 
it's a write cache, period.  And if that write cache loses its contents 
if power is lost, it's a volatile write cache that can cause database 
corruption.  The fact that the one on the Intel devices is very small, 
basically just dealing with the block chunking behavior you describe, 
doesn't change either of those facts.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Ron Mayer
Bruce Momjian wrote:
> Greg Smith wrote:
>>  If you have a regular SATA drive, it almost certainly 
>> supports proper cache flushing
> 
> OK, but I have a few questions.  Is a write to the drive and a cache
> flush command the same?

I believe they're different as of ATAPI-6 from 2001.

> Which file systems implement both?

Seems ZFS and recent ext4 have thought these interactions out
thoroughly.   Find a slow ext4 that people complain about, and
that's the one doing it right :-).

Ext3 has some particularly odd annoyances where it flushes and waits
for certain writes (ones involving inode changes) but doesn't bother
to flush others (just data changes).   As far as I can tell, with
ext3 you need userspace utilities to make sure flushes occur when
you need them.At one point I was tempted to try to put such
userspace hacks into postgres.

I know less about other file systems.  Apparently the NTFS guys
are aware of such stuff - but don't know what kinds of fsync equivalent
you'd need to make it happen.

Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.

>  I thought a
> write to the drive was always assumed to flush it to the platters,
> assuming the drive's cache is set to write-through.

Apparently somewhere around here:
http://www.t10.org/t13/project/d1410r3a-ATA-ATAPI-6.pdf
they were separated in the IDE world.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Bruce Momjian
Ron Mayer wrote:
> Bruce Momjian wrote:
> > Agreed, thought I thought the problem was that SSDs lie about their
> > cache flush like SATA drives do, or is there something I am missing?
> 
> There's exactly one case I can find[1] where this century's IDE
> drives lied more than any other drive with a cache:
> 
>   Under 120GB Maxtor drives from late 2003 to early 2004.
> 
> and it's apparently been worked around for years.
> 
> Those drives claimed to support the "FLUSH_CACHE_EXT" feature (IDE
> command 0xEA), but did not support sending 48-bit commands which
> was needed to send the cache flushing command.
> 
> And for that case a workaround for Linux was quickly identified by
> checking for *both* the support for 48-bit commands and support for the
> flush cache extension[2].
> 
> 
> Beyond those 2004 drive + 2003 kernel systems, I think most the rest
> of such reports have been various misfeatures in some of Linux's
> filesystems (like EXT3 that only wants to send drives cache-flushing
> commands when inode change[3]) and linux software raid misfeatures
> 
> ...and ISTM those would affect SSDs the same way they'd affect SATA drives.

I think the point is not that drives lie about their write-back and
write-through behavior, but rather that many SATA/IDE drives default to
write-back, and not write-through, and many administrators an file
systems are not aware of this behavior.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-22 Thread Bruce Momjian
Greg Smith wrote:
> Ron Mayer wrote:
> > Bruce Momjian wrote:
> >   
> >> Agreed, thought I thought the problem was that SSDs lie about their
> >> cache flush like SATA drives do, or is there something I am missing?
> >> 
> >
> > There's exactly one case I can find[1] where this century's IDE
> > drives lied more than any other drive with a cache:
> 
> Ron is correct that the problem of mainstream SATA drives accepting the 
> cache flush command but not actually doing anything with it is long gone 
> at this point.  If you have a regular SATA drive, it almost certainly 
> supports proper cache flushing.  And if your whole software/storage 
> stacks understands all that, you should not end up with corrupted data 
> just because there's a volative write cache in there.

OK, but I have a few questions.  Is a write to the drive and a cache
flush command the same?  Which file systems implement both?  I thought a
write to the drive was always assumed to flush it to the platters,
assuming the drive's cache is set to write-through.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-21 Thread Arjen van der Meijden

On 22-2-2010 6:39 Greg Smith wrote:

But the point of this whole testing exercise coming back into vogue
again is that SSDs have returned this negligent behavior to the
mainstream again. See
http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion
of this in a ZFS context just last month. There are many documented
cases of Intel SSDs that will fake a cache flush, such that the only way
to get good reliable writes is to totally disable their writes
caches--at which point performance is so bad you might as well have
gotten a RAID10 setup instead (and longevity is toast too).


That's weird. Intel's SSD's didn't have a write cache afaik:
"I asked Intel about this and it turns out that the DRAM on the Intel 
drive isn't used for user data because of the risk of data loss, instead 
it is used as memory by the Intel SATA/flash controller for deciding 
exactly where to write data (I'm assuming for the wear 
leveling/reliability algorithms)."

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10

But that is the old version, perhaps the second generation does have a 
bit of write caching.


I can understand a SSD might do unexpected things when it loses power 
all of a sudden. It will probably try to group writes to fill a single 
block (and those blocks vary in size but are normally way larger than 
those of a normal spinning disk, they are values like 256 or 512KB) and 
it might loose that "waiting until a full block can be written"-data or 
perhaps it just couldn't complete a full block-write due to the power 
failure.
Although that behavior isn't really what you want, it would be incorrect 
to blame write caching for the behavior if the device doesn't even have 
a write cache ;)


Best regards,

Arjen


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-21 Thread Greg Smith

Ron Mayer wrote:

Bruce Momjian wrote:
  

Agreed, thought I thought the problem was that SSDs lie about their
cache flush like SATA drives do, or is there something I am missing?



There's exactly one case I can find[1] where this century's IDE
drives lied more than any other drive with a cache:


Ron is correct that the problem of mainstream SATA drives accepting the 
cache flush command but not actually doing anything with it is long gone 
at this point.  If you have a regular SATA drive, it almost certainly 
supports proper cache flushing.  And if your whole software/storage 
stacks understands all that, you should not end up with corrupted data 
just because there's a volative write cache in there.


But the point of this whole testing exercise coming back into vogue 
again is that SSDs have returned this negligent behavior to the 
mainstream again.  See 
http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion 
of this in a ZFS context just last month.  There are many documented 
cases of Intel SSDs that will fake a cache flush, such that the only way 
to get good reliable writes is to totally disable their writes 
caches--at which point performance is so bad you might as well have 
gotten a RAID10 setup instead (and longevity is toast too).


This whole area remains a disaster area and extreme distrust of all the 
SSD storage vendors is advisable at this point.  Basically, if I don't 
see the capacitor responsible for flushing outstanding writes, and get a 
clear description from the manufacturer how the cached writes are going 
to be handled in the event of a power failure, at this point I have to 
assume the answer is "badly and your data will be eaten".  And the 
prices for SSDs that meet that requirement are still quite steep.  I 
keep hoping somebody will address this market at something lower than 
the standard "enterprise" prices.  The upcoming SandForce designs seem 
to have thought this through correctly:  
http://www.anandtech.com/storage/showdoc.aspx?i=3702&p=6  But the 
product's not out to the general public yet (just like the Seagate units 
that claim to have capacitor backups--I heard a rumor those are also 
Sandforce designs actually, so they may be the only ones doing this 
right and aiming at a lower price).


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us



Re: [PERFORM] SSD + RAID

2010-02-21 Thread Ron Mayer
Bruce Momjian wrote:
> Agreed, thought I thought the problem was that SSDs lie about their
> cache flush like SATA drives do, or is there something I am missing?

There's exactly one case I can find[1] where this century's IDE
drives lied more than any other drive with a cache:

  Under 120GB Maxtor drives from late 2003 to early 2004.

and it's apparently been worked around for years.

Those drives claimed to support the "FLUSH_CACHE_EXT" feature (IDE
command 0xEA), but did not support sending 48-bit commands which
was needed to send the cache flushing command.

And for that case a workaround for Linux was quickly identified by
checking for *both* the support for 48-bit commands and support for the
flush cache extension[2].


Beyond those 2004 drive + 2003 kernel systems, I think most the rest
of such reports have been various misfeatures in some of Linux's
filesystems (like EXT3 that only wants to send drives cache-flushing
commands when inode change[3]) and linux software raid misfeatures

...and ISTM those would affect SSDs the same way they'd affect SATA drives.


[1] http://lkml.org/lkml/2004/5/12/132
[2] http://lkml.org/lkml/2004/5/12/200
[3] http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-21 Thread Bruce Momjian
Scott Carey wrote:
> On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote:
> 
> > Dan Langille wrote:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA1
> >>
> >> Bruce Momjian wrote:
> >>> Matthew Wakeling wrote:
>  On Fri, 13 Nov 2009, Greg Smith wrote:
> > In order for a drive to work reliably for database use such as for
> > PostgreSQL, it cannot have a volatile write cache.  You either need a 
> > write
> > cache with a battery backup (and a UPS doesn't count), or to turn the 
> > cache
> > off.  The SSD performance figures you've been looking at are with the 
> > drive's
> > write cache turned on, which means they're completely fictitious and
> > exaggerated upwards for your purposes.  In the real world, that will 
> > result
> > in database corruption after a crash one day.
>  Seagate are claiming to be on the ball with this one.
> 
>  http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
> >>>
> >>> I have updated our documentation to mention that even SSD drives often
> >>> have volatile write-back caches.  Patch attached and applied.
> >>
> >> Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
> >> Do the characteristics of ZFS avoid this issue entirely?
> >
> > No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
> > assumes something sent to the drive is permanent or it would have no way
> > to operate.
> >
> 
> ZFS is write-back cache aware, and safe provided the drive's
> cache flushing and write barrier related commands work.  It will
> flush data in 'transaction groups' and flush the drive write
> caches at the end of those transactions.  Since its copy on
> write, it can ensure that all the changes in the transaction
> group appear on disk, or all are lost.  This all works so long
> as the cache flush commands do.

Agreed, thought I thought the problem was that SSDs lie about their
cache flush like SATA drives do, or is there something I am missing?

--
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-21 Thread Scott Carey
On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote:

> Dan Langille wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>> 
>> Bruce Momjian wrote:
>>> Matthew Wakeling wrote:
 On Fri, 13 Nov 2009, Greg Smith wrote:
> In order for a drive to work reliably for database use such as for 
> PostgreSQL, it cannot have a volatile write cache.  You either need a 
> write 
> cache with a battery backup (and a UPS doesn't count), or to turn the 
> cache 
> off.  The SSD performance figures you've been looking at are with the 
> drive's 
> write cache turned on, which means they're completely fictitious and 
> exaggerated upwards for your purposes.  In the real world, that will 
> result 
> in database corruption after a crash one day.
 Seagate are claiming to be on the ball with this one.
 
 http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
>>> 
>>> I have updated our documentation to mention that even SSD drives often
>>> have volatile write-back caches.  Patch attached and applied.
>> 
>> Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
>> Do the characteristics of ZFS avoid this issue entirely?
> 
> No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
> assumes something sent to the drive is permanent or it would have no way
> to operate.
> 

ZFS is write-back cache aware, and safe provided the drive's cache flushing and 
write barrier related commands work.  It will flush data in 'transaction 
groups' and flush the drive write caches at the end of those transactions.  
Since its copy on write, it can ensure that all the changes in the transaction 
group appear on disk, or all are lost.  This all works so long as the cache 
flush commands do.


> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
>  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
>  + If your life is a hard drive, Christ can be your backup. +
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-20 Thread Bruce Momjian
Dan Langille wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Bruce Momjian wrote:
> > Matthew Wakeling wrote:
> >> On Fri, 13 Nov 2009, Greg Smith wrote:
> >>> In order for a drive to work reliably for database use such as for 
> >>> PostgreSQL, it cannot have a volatile write cache.  You either need a 
> >>> write 
> >>> cache with a battery backup (and a UPS doesn't count), or to turn the 
> >>> cache 
> >>> off.  The SSD performance figures you've been looking at are with the 
> >>> drive's 
> >>> write cache turned on, which means they're completely fictitious and 
> >>> exaggerated upwards for your purposes.  In the real world, that will 
> >>> result 
> >>> in database corruption after a crash one day.
> >> Seagate are claiming to be on the ball with this one.
> >>
> >> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
> > 
> > I have updated our documentation to mention that even SSD drives often
> > have volatile write-back caches.  Patch attached and applied.
> 
> Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
> Do the characteristics of ZFS avoid this issue entirely?

No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
assumes something sent to the drive is permanent or it would have no way
to operate.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-20 Thread Dan Langille
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Bruce Momjian wrote:
> Matthew Wakeling wrote:
>> On Fri, 13 Nov 2009, Greg Smith wrote:
>>> In order for a drive to work reliably for database use such as for 
>>> PostgreSQL, it cannot have a volatile write cache.  You either need a write 
>>> cache with a battery backup (and a UPS doesn't count), or to turn the cache 
>>> off.  The SSD performance figures you've been looking at are with the 
>>> drive's 
>>> write cache turned on, which means they're completely fictitious and 
>>> exaggerated upwards for your purposes.  In the real world, that will result 
>>> in database corruption after a crash one day.
>> Seagate are claiming to be on the ball with this one.
>>
>> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
> 
> I have updated our documentation to mention that even SSD drives often
> have volatile write-back caches.  Patch attached and applied.

Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
Do the characteristics of ZFS avoid this issue entirely?

- --
Dan Langille

BSDCan - The Technical BSD Conference : http://www.bsdcan.org/
PGCon  - The PostgreSQL Conference: http://www.pgcon.org/
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.13 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuAayQACgkQCgsXFM/7nTyMggCgnZUbVzldxjp/nPo8EL1Nq6uG
6+IAoNGIB9x8/mwUQidjM9nnAADRbr9j
=3RJi
-END PGP SIGNATURE-

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2010-02-20 Thread Bruce Momjian
Matthew Wakeling wrote:
> On Fri, 13 Nov 2009, Greg Smith wrote:
> > In order for a drive to work reliably for database use such as for 
> > PostgreSQL, it cannot have a volatile write cache.  You either need a write 
> > cache with a battery backup (and a UPS doesn't count), or to turn the cache 
> > off.  The SSD performance figures you've been looking at are with the 
> > drive's 
> > write cache turned on, which means they're completely fictitious and 
> > exaggerated upwards for your purposes.  In the real world, that will result 
> > in database corruption after a crash one day.
> 
> Seagate are claiming to be on the ball with this one.
> 
> http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/

I have updated our documentation to mention that even SSD drives often
have volatile write-back caches.  Patch attached and applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.61
diff -c -c -r1.61 wal.sgml
*** doc/src/sgml/wal.sgml	3 Feb 2010 17:25:06 -	1.61
--- doc/src/sgml/wal.sgml	20 Feb 2010 18:26:40 -
***
*** 59,65 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure.  To check write caching on Linux use
 hdparm -I;  it is enabled if there is a * next
 to Write cache; hdparm -W to turn off
 write caching.  On FreeBSD use
--- 59,66 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure.  Many solid-state drives also have volatile write-back
!caches.  To check write caching on Linux use
 hdparm -I;  it is enabled if there is a * next
 to Write cache; hdparm -W to turn off
 write caching.  On FreeBSD use

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-12-08 Thread Matthew Wakeling

On Fri, 13 Nov 2009, Greg Smith wrote:
In order for a drive to work reliably for database use such as for 
PostgreSQL, it cannot have a volatile write cache.  You either need a write 
cache with a battery backup (and a UPS doesn't count), or to turn the cache 
off.  The SSD performance figures you've been looking at are with the drive's 
write cache turned on, which means they're completely fictitious and 
exaggerated upwards for your purposes.  In the real world, that will result 
in database corruption after a crash one day.


Seagate are claiming to be on the ball with this one.

http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/

Matthew

--
The third years are wandering about all worried at the moment because they
have to hand in their final projects. Please be sympathetic to them, say
things like "ha-ha-ha", but in a sympathetic tone of voice 
   -- Computer Science Lecturer


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-12-03 Thread Scott Carey

On 11/19/09 1:04 PM, "Greg Smith"  wrote:

> That won't help.  Once the checkpoint is done, the problem isn't just
> that the WAL segments are recycled.  The server isn't going to use them
> even if they were there.  The reason why you can erase/recycle them is
> that you're doing so *after* writing out a checkpoint record that says
> you don't have to ever look at them again.  What you'd actually have to
> do is hack the server code to insert that delay after every fsync--there
> are none that you can cheat on and not introduce a corruption
> possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't
> make a lot of assumptions about what the underlying disk has to actually
> do beyond the fsync requirement; the flip side to that robustness is
> that it's the one you can't ever violate safely.

Yeah, I guess its not so easy.  Having the system "hold" one extra
checkpoint worth of segments and then during recovery, always replay that
previoius one plus the current might work, but I don't know if that could
cause corruption.  I assume replaying a log twice won't, so replaying N-1
checkpoint, then the current one, might work.  If so that would be a cool
feature -- so long as the N-2 checkpoint is no longer in the OS or I/O
hardware caches when checkpoint N completes, you're safe!  Its probably more
complicated though, especially with respect to things like MVCC on DDL
changes.

> Right.  It's not used like the write-cache on a regular hard drive,
> where they're buffering 8MB-32MB worth of writes just to keep seek
> overhead down.  It's there primarily to allow combining writes into
> large chunks, to better match the block size of the underlying SSD flash
> cells (128K).  Having enough space for two full cells allows spooling
> out the flash write to a whole block while continuing to buffer the next
> one.
> 
> This is why turning the cache off can tank performance so badly--you're
> going to be writing a whole 128K block no matter what if it's force to
> disk without caching, even if it's just to write a 8K page to it.

As others mentioned, flash must erase a whole block at once, but it can
write sequentially to a block in much smaller chunks.   I believe that MLC
and SLC differ a bit here, SLC can write smaller subsections of the erase
block.

A little old but still very useful:
http://research.microsoft.com/apps/pubs/?id=63596

> That's only going to reach 1/16 of the usual write speed on single page
> writes.  And that's why you should also be concerned at whether
> disabling the write cache impacts the drive longevity, lots of small
> writes going out in small chunks is going to wear flash out much faster
> than if the drive is allowed to wait until it's got a full sized block
> to write every time.

This is still a concern, since even if the SLC cells are technically capable
of writing sequentially in smaller chunks, with the write cache off they may
not do so.  

> 
> The fact that the cache is so small is also why it's harder to catch the
> drive doing the wrong thing here.  The plug test is pretty sensitive to
> a problem when you've got megabytes worth of cached writes that are
> spooling to disk at spinning hard drive speeds.  The window for loss on
> a SSD with no seek overhead and only a moderate number of KB worth of
> cached data is much, much smaller.  Doesn't mean it's gone though.  It's
> a shame that the design wasn't improved just a little bit; a cheap
> capacitor and blocking new writes once the incoming power dropped is all
> it would take to make these much more reliable for database use.  But
> that would raise the price, and not really help anybody but the small
> subset of the market that cares about durable writes.

Yup.  There are manufacturers who claim no data loss on power failure,
hopefully these become more common.
http://www.wdc.com/en/products/ssd/technology.asp?id=1

I still contend its a lot more safe than a hard drive.  I have not seen one
fail yet (out of about 150 heavy use drive-years on X25-Ms).  Any system
that does not have a battery backed write cache will be faster and safer if
an SSD, with write cache on, than hard drives with write cache on.

BBU caching is not fail-safe either, batteries wear out, cards die or
malfunction.
If you need the maximum data integrity, you will probably go with a
battery-backed cache raid setup with or without SSDs.  If you don't go that
route SSD's seem like the best option.  The 'middle ground' of software raid
with hard drives with their write caches off doesn't seem useful to me at
all.  I can't think of one use case that isn't better served by a slightly
cheaper array of disks with a hardware bbu card (if the data is important or
data size is large) OR a set of SSD's (if performance is more important than
data safety). 

>> 4: Yet another solution:  The drives DO adhere to write barriers properly.
>> A filesystem that used these in the process of fsync() would be fine too.
>> So XFS without LVM or MD (or the newer ve

Re: [PERFORM] SSD + RAID

2009-11-30 Thread Bruce Momjian
Ron Mayer wrote:
> Bruce Momjian wrote:
> > Greg Smith wrote:
> >> Bruce Momjian wrote:
> >>> I thought our only problem was testing the I/O subsystem --- I never
> >>> suspected the file system might lie too.  That email indicates that a
> >>> large percentage of our install base is running on unreliable file
> >>> systems --- why have I not heard about this before?
> >>>   
> >> he reason why it 
> >> doesn't bite more people is that most Linux systems don't turn on write 
> >> barrier support by default, and there's a number of situations that can 
> >> disable barriers even if you did try to enable them.  It's still pretty 
> >> unusual to have a working system with barriers turned on nowadays; I 
> >> really doubt it's "a large percentage of our install base".
> > 
> > Ah, so it is only when write barriers are enabled, and they are not
> > enabled by default --- OK, that makes sense.
> 
> The test program I linked up-thread shows that fsync does nothing
> unless the inode's touched on an out-of-the-box Ubuntu 9.10 using
> ext3 on a straight from Dell system.
> 
> Surely that's a common config, no?

Yea, this certainly suggests that the problem is wide-spread.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-30 Thread Ron Mayer
Bruce Momjian wrote:
> Greg Smith wrote:
>> Bruce Momjian wrote:
>>> I thought our only problem was testing the I/O subsystem --- I never
>>> suspected the file system might lie too.  That email indicates that a
>>> large percentage of our install base is running on unreliable file
>>> systems --- why have I not heard about this before?
>>>   
>> he reason why it 
>> doesn't bite more people is that most Linux systems don't turn on write 
>> barrier support by default, and there's a number of situations that can 
>> disable barriers even if you did try to enable them.  It's still pretty 
>> unusual to have a working system with barriers turned on nowadays; I 
>> really doubt it's "a large percentage of our install base".
> 
> Ah, so it is only when write barriers are enabled, and they are not
> enabled by default --- OK, that makes sense.

The test program I linked up-thread shows that fsync does nothing
unless the inode's touched on an out-of-the-box Ubuntu 9.10 using
ext3 on a straight from Dell system.

Surely that's a common config, no?

If I uncomment the fchmod lines below I can see that even with ext3
and write caches enabled on my drives it does indeed wait.
Note that EXT4 doesn't show the problem on the same system.

Here's a slightly modified test program that's a bit easier to run.
If you run the program and it exits right away, your system isn't
waiting for platters to spin.


/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
** If this program returns instantly, the fsync() lied.
** If it takes a second or so, fsync() probably works.
** On ext3 and drives that cache writes, you probably need
** to uncomment the fchmod's to make fsync work right.
*/
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc,char *argv[]) {
  if (argc<2) {
printf("usage: fs \n");
exit(1);
  }
  int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
  int i;
  for (i=0;i<100;i++) {
char byte;
pwrite (fd, &byte, 1, 0);
// fchmod (fd, 0644); fchmod (fd, 0664);
fsync (fd);
  }
}

r...@ron-desktop:/tmp$ /usr/bin/time ./a.out foo
0.00user 0.00system 0:00.01elapsed 21%CPU (0avgtext+0avgdata 0maxresident)k



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-30 Thread Ron Mayer
Bruce Momjian wrote:
>> For example, ext3 fsync() will issue write barrier commands
>> if the inode was modified; but not if the inode wasn't.
>>
>> See test program here:
>> http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html
>> and read two paragraphs further to see how touching
>> the inode makes ext3 fsync behave differently.
> 
> I thought our only problem was testing the I/O subsystem --- I never
> suspected the file system might lie too.  That email indicates that a
> large percentage of our install base is running on unreliable file
> systems --- why have I not heard about this before?  

It came up a on these lists a few times in the past.  Here's one example.
http://archives.postgresql.org/pgsql-performance/2008-08/msg00159.php

As far as I can tell, most of the threads ended with people still
suspecting lying hard drives.  But to the best of my ability I can't
find any drives that actually lie when sent the commands to flush
their caches.  But various combinations of ext3 & linux MD that
decide not to send IDE FLUSH_CACHE_EXT (nor the similiar
SCSI SYNCHRONIZE CACHE command) under various situations.

I wonder if there are enough ext3 users out there that postgres should
touch the inodes before doing a fsync.

> Do the write barriers allow data loss but prevent data inconsistency?  

If I understand right, data inconsistency could occur too.  One
aspect of the write barriers is flushing a hard drive's caches.

> It sound like they are effectively running with synchronous_commit = off.

And with the (mythical?) hard drive with lying caches.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-30 Thread Bruce Momjian
Greg Smith wrote:
> Bruce Momjian wrote:
> > I thought our only problem was testing the I/O subsystem --- I never
> > suspected the file system might lie too.  That email indicates that a
> > large percentage of our install base is running on unreliable file
> > systems --- why have I not heard about this before?  Do the write
> > barriers allow data loss but prevent data inconsistency?  It sound like
> > they are effectively running with synchronous_commit = off.
> >   
> You might occasionally catch me ranting here that Linux write barriers 
> are not a useful solution at all for PostgreSQL, and that you must turn 
> the disk write cache off rather than expect the barrier implementation 
> to do the right thing.  This sort of buginess is why.  The reason why it 
> doesn't bite more people is that most Linux systems don't turn on write 
> barrier support by default, and there's a number of situations that can 
> disable barriers even if you did try to enable them.  It's still pretty 
> unusual to have a working system with barriers turned on nowadays; I 
> really doubt it's "a large percentage of our install base".

Ah, so it is only when write barriers are enabled, and they are not
enabled by default --- OK, that makes sense.

> I've started keeping most of my notes about where ext3 is vulnerable to 
> issues in Wikipedia, specifically
> http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just 
> updated that section to point out the specific issue Ron pointed out.  
> Maybe we should point people toward that in the docs, I try to keep that 
> article correct.

Yes, good idea.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-29 Thread Greg Smith

Bruce Momjian wrote:

I thought our only problem was testing the I/O subsystem --- I never
suspected the file system might lie too.  That email indicates that a
large percentage of our install base is running on unreliable file
systems --- why have I not heard about this before?  Do the write
barriers allow data loss but prevent data inconsistency?  It sound like
they are effectively running with synchronous_commit = off.
  
You might occasionally catch me ranting here that Linux write barriers 
are not a useful solution at all for PostgreSQL, and that you must turn 
the disk write cache off rather than expect the barrier implementation 
to do the right thing.  This sort of buginess is why.  The reason why it 
doesn't bite more people is that most Linux systems don't turn on write 
barrier support by default, and there's a number of situations that can 
disable barriers even if you did try to enable them.  It's still pretty 
unusual to have a working system with barriers turned on nowadays; I 
really doubt it's "a large percentage of our install base".


I've started keeping most of my notes about where ext3 is vulnerable to 
issues in Wikipedia, specifically
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just 
updated that section to point out the specific issue Ron pointed out.  
Maybe we should point people toward that in the docs, I try to keep that 
article correct.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-29 Thread Bruce Momjian
Ron Mayer wrote:
> Bruce Momjian wrote:
> > Greg Smith wrote:
> >> A good test program that is a bit better at introducing and detecting 
> >> the write cache issue is described at 
> >> http://brad.livejournal.com/2116715.html
> > 
> > Wow, I had not seen that tool before.  I have added a link to it from
> > our documentation, and also added a mention of our src/tools/fsync test
> > tool to our docs.
> 
> One challenge with many of these test programs is that some
> filesystem (ext3 is one) will flush drive caches on fsync()
> *sometimes, but not always.   If your test program happens to do
> a sequence of commands that makes an fsync() actually flush a
> disk's caches, it might mislead you if your actual application
> has a different series of system calls.
> 
> For example, ext3 fsync() will issue write barrier commands
> if the inode was modified; but not if the inode wasn't.
> 
> See test program here:
> http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html
> and read two paragraphs further to see how touching
> the inode makes ext3 fsync behave differently.

I thought our only problem was testing the I/O subsystem --- I never
suspected the file system might lie too.  That email indicates that a
large percentage of our install base is running on unreliable file
systems --- why have I not heard about this before?  Do the write
barriers allow data loss but prevent data inconsistency?  It sound like
they are effectively running with synchronous_commit = off.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-29 Thread Ron Mayer
Bruce Momjian wrote:
> Greg Smith wrote:
>> A good test program that is a bit better at introducing and detecting 
>> the write cache issue is described at 
>> http://brad.livejournal.com/2116715.html
> 
> Wow, I had not seen that tool before.  I have added a link to it from
> our documentation, and also added a mention of our src/tools/fsync test
> tool to our docs.

One challenge with many of these test programs is that some
filesystem (ext3 is one) will flush drive caches on fsync()
*sometimes, but not always.   If your test program happens to do
a sequence of commands that makes an fsync() actually flush a
disk's caches, it might mislead you if your actual application
has a different series of system calls.

For example, ext3 fsync() will issue write barrier commands
if the inode was modified; but not if the inode wasn't.

See test program here:
http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html
and read two paragraphs further to see how touching
the inode makes ext3 fsync behave differently.




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-28 Thread Bruce Momjian
Greg Smith wrote:
> Merlin Moncure wrote:
> > I am right now talking to someone on postgresql irc who is measuring
> > 15k iops from x25-e and no data loss following power plug test.
> The funny thing about Murphy is that he doesn't visit when things are 
> quiet.  It's quite possible the window for data loss on the drive is 
> very small.  Maybe you only see it one out of 10 pulls with a very 
> aggressive database-oriented write test.  Whatever the odd conditions 
> are, you can be sure you'll see them when there's a bad outage in actual 
> production though.
> 
> A good test program that is a bit better at introducing and detecting 
> the write cache issue is described at 
> http://brad.livejournal.com/2116715.html

Wow, I had not seen that tool before.  I have added a link to it from
our documentation, and also added a mention of our src/tools/fsync test
tool to our docs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/config.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.233
diff -c -c -r1.233 config.sgml
*** doc/src/sgml/config.sgml	13 Nov 2009 22:43:39 -	1.233
--- doc/src/sgml/config.sgml	28 Nov 2009 16:12:46 -
***
*** 1432,1437 
--- 1432,1439 
  The default is the first method in the above list that is supported
  by the platform.
  The open_* options also use O_DIRECT if available.
+ The utility src/tools/fsync in the PostgreSQL source tree
+ can do performance testing of various fsync methods.
  This parameter can only be set in the postgresql.conf
  file or on the server command line.
 
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.59
diff -c -c -r1.59 wal.sgml
*** doc/src/sgml/wal.sgml	9 Apr 2009 16:20:50 -	1.59
--- doc/src/sgml/wal.sgml	28 Nov 2009 16:12:57 -
***
*** 86,91 
--- 86,93 
 ensure data integrity.  Avoid disk controllers that have non-battery-backed
 write caches.  At the drive level, disable write-back caching if the
 drive cannot guarantee the data will be written before shutdown.
+You can test for reliable I/O subsystem behavior using http://brad.livejournal.com/2116715.html";>diskchecker.pl.

  


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-21 Thread Merlin Moncure
On Fri, Nov 20, 2009 at 7:27 PM, Greg Smith  wrote:
> Richard Neill wrote:
>>
>> The key issue for short,fast transactions seems to be
>> how fast an fdatasync() call can run, forcing the commit to disk, and
>> allowing the transaction to return to userspace.
>> Attached is a short C program which may be of use.
>
> Right.  I call this the "commit rate" of the storage, and on traditional
> spinning disks it's slightly below the rotation speed of the media (i.e.
> 7200RPM = 120 commits/second).    If you've got a battery-backed cache in
> front of standard disks, you can easily clear 10K commits/second.


...until you overflow the cache.  battery backed cache does not break
the laws of physics...it just provides a higher burst rate (plus what
ever advantages can be gained by peeking into the write queue and
re-arranging/grouping.  I learned the hard way that how your raid
controller behaves in overflow situations can cause catastrophic
performance degradations...

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-20 Thread Greg Smith

Richard Neill wrote:

The key issue for short,fast transactions seems to be
how fast an fdatasync() call can run, forcing the commit to disk, and
allowing the transaction to return to userspace.
Attached is a short C program which may be of use.
Right.  I call this the "commit rate" of the storage, and on traditional 
spinning disks it's slightly below the rotation speed of the media (i.e. 
7200RPM = 120 commits/second).If you've got a battery-backed cache 
in front of standard disks, you can easily clear 10K commits/second.


I normally test that out with sysbench, because I use that for some 
other tests anyway:


sysbench --test=fileio --file-fsync-freq=1 --file-num=1 
--file-total-size=16384 --file-test-mode=rndwr run | grep "Requests/sec"


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-20 Thread Richard Neill

Axel Rau wrote:


Am 13.11.2009 um 14:57 schrieb Laszlo Nagy:

I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately, 
I cannot find information about using ARECA cards with SSD drives.
They told me: currently not supported, but they have positive customer 
reports. No date yet for implementation of the TRIM command in firmware.

...
My other option is to buy two SLC SSD drives and use RAID1. It would 
cost about the same, but has less redundancy and less capacity. Which 
is the faster? 8-10 MLC disks in RAID 6 with a good caching 
controller, or two SLC disks in RAID1?


Despite my other problems, I've found that the Intel X25-Es work
remarkably well. The key issue for short,fast transactions seems to be
how fast an fdatasync() call can run, forcing the commit to disk, and
allowing the transaction to return to userspace.
With all the caches off, the intel X25-E beat a standard disk by a
factor of about 10.
Attached is a short C program which may be of use.


For what it's worth, we have actually got a pretty decent (and
redundant) setup using a RAIS array of RAID1.


[primary server]

SSD }
 }  RAID1  ---}  DRBD --- /var/lib/postgresql
SSD }}
  }
  }
  }
  }
[secondary server]   }
  }
SSD }}
 }  RAID1 gigE}
SSD }



The servers connect back-to-back with a dedicated Gigabit ethernet
cable, and DRBD is running in protocol B.

We can pull the power out of 1 server, and be using the next within 30
seconds, and with no dataloss.


Richard



#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NUM_ITER 1024

int main ( int argc, char **argv ) {
	const char data[] = "Liberate";
	size_t data_len = strlen ( data );
	const char *filename;
	int fd; 
	unsigned int i;

	if ( argc != 2 ) {
		fprintf ( stderr, "Syntax: %s output_file\n", argv[0] );
		exit ( 1 );
	}
	filename = argv[1];
	fd = open ( filename, ( O_WRONLY | O_CREAT | O_EXCL ), 0666 );
	if ( fd < 0 ) {
		fprintf ( stderr, "Could not create \"%s\": %s\n",
			  filename, strerror ( errno ) );
		exit ( 1 );
	}

	for ( i = 0 ; i < NUM_ITER ; i++ ) {
		if ( write ( fd, data, data_len ) != data_len ) {
			fprintf ( stderr, "Could not write: %s\n",
  strerror ( errno ) );
			exit ( 1 );
		}
		if ( fdatasync ( fd ) != 0 ) {
			fprintf ( stderr, "Could not fdatasync: %s\n",
  strerror ( errno ) );
			exit ( 1 );
		}
	}
	return 0;
}


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-20 Thread Jeff Janes
On Wed, Nov 18, 2009 at 8:24 PM, Tom Lane  wrote:
> Scott Carey  writes:
>> For your database DATA disks, leaving the write cache on is 100% acceptable,
>> even with power loss, and without a RAID controller.  And even in high write
>> environments.
>
> Really?  How hard have you tested that configuration?
>
>> That is what the XLOG is for, isn't it?
>
> Once we have fsync'd a data change, we discard the relevant XLOG
> entries.  If the disk hasn't actually put the data on stable storage
> before it claims the fsync is done, you're screwed.
>
> XLOG only exists to centralize the writes that have to happen before
> a transaction can be reported committed (in particular, to avoid a
> lot of random-access writes at commit).  It doesn't make any
> fundamental change in the rules of the game: a disk that lies about
> write complete will still burn you.
>
> In a zero-seek-cost environment I suspect that XLOG wouldn't actually
> be all that useful.

You would still need it to guard against partial page writes, unless
we have some guarantee that those can't happen.

And once your transaction has scattered its transaction id into
various xmin and xmax over many tables, you need an atomic, durable
repository to decide if that id has or has not committed.  Maybe clog
fsynced on commit would serve this purpose?

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-20 Thread Matthew Wakeling

On Thu, 19 Nov 2009, Greg Smith wrote:
This is why turning the cache off can tank performance so badly--you're going 
to be writing a whole 128K block no matter what if it's force to disk without 
caching, even if it's just to write a 8K page to it.


Theoretically, this does not need to be the case. Now, I don't know what 
the Intel drives actually do, but remember that for flash, it is the 
*erase* cycle that has to be done in large blocks. Writing itself can be 
done in small blocks, to previously erased sites.


The technology for combining small writes into sequential writes has been 
around for 17 years or so in 
http://portal.acm.org/citation.cfm?id=146943&dl= so there really isn't any 
excuse for modern flash drives not giving really fast small writes.


Matthew

--
for a in past present future; do
  for b in clients employers associates relatives neighbours pets; do
  echo "The opinions here in no way reflect the opinions of my $a $b."
done; done

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-20 Thread Axel Rau


Am 13.11.2009 um 14:57 schrieb Laszlo Nagy:

I was thinking about ARECA 1320 with 2GB memory + BBU.  
Unfortunately, I cannot find information about using ARECA cards  
with SSD drives.
They told me: currently not supported, but they have positive customer  
reports. No date yet for implementation of the TRIM command in firmware.

...
My other option is to buy two SLC SSD drives and use RAID1. It would  
cost about the same, but has less redundancy and less capacity.  
Which is the faster? 8-10 MLC disks in RAID 6 with a good caching  
controller, or two SLC disks in RAID1?

I just went the MLC path with X25-Ms mainly to save energy.
The fresh assembled box has one SSD for WAL and one RAID 0 with for  
SSDs as table space.
Everything runs smoothly on a areca 1222 with BBU, which turned all  
write caches off.

OS is FreeBSD 8.0. I aligned all partitions on 1 MB boundaries.
Next week I will install 8.4.1 and run pgbench for pull-the-plug- 
testing.


I would like to get some advice from the list for testing the SSDs!

Axel
---
axel@chaos1.de  PGP-Key:29E99DD6  +49 151 2300 9283  computing @  
chaos claudius











--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Scott Marlowe
On Thu, Nov 19, 2009 at 2:39 PM, Merlin Moncure  wrote:
> On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith  wrote:
>> You can use pgbench to either get interesting peak read results, or peak
>> write ones, but it's not real useful for things in between.  The standard
>> test basically turns into a huge stack of writes to a single table, and the
>> select-only one is interesting to gauge either cached or uncached read speed
>> (depending on the scale).  It's not very useful for getting a feel for how
>> something with a mixed read/write workload does though, which is unfortunate
>> because I think that scenario is much more common than what it does test.
>
> all true, but it's pretty easy to rig custom (-f) commands for
> virtually any test you want,.

My primary use of pgbench is to exercise a machine as a part of
acceptance testing.  After using it to do power plug pulls, I run it
for a week or two to exercise the drive array and controller mainly.
Any machine that runs smooth for a week with a load factor of 20 or 30
and the amount of updates that pgbench generates don't overwhelm it
I'm pretty happy.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Merlin Moncure
On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith  wrote:
> You can use pgbench to either get interesting peak read results, or peak
> write ones, but it's not real useful for things in between.  The standard
> test basically turns into a huge stack of writes to a single table, and the
> select-only one is interesting to gauge either cached or uncached read speed
> (depending on the scale).  It's not very useful for getting a feel for how
> something with a mixed read/write workload does though, which is unfortunate
> because I think that scenario is much more common than what it does test.

all true, but it's pretty easy to rig custom (-f) commands for
virtually any test you want,.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith

Scott Marlowe wrote:

On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure  wrote:
  

pgbench is actually a pretty awesome i/o tester assuming you have big
enough scaling factor

Seeing as how pgbench only goes to scaling factor of 4000, are the any
plans on enlarging that number?
  
I'm doing pgbench tests now on a system large enough for this limit to 
matter, so I'm probably going to have to fix that for 8.5 just to 
complete my own work.


You can use pgbench to either get interesting peak read results, or peak 
write ones, but it's not real useful for things in between.  The 
standard test basically turns into a huge stack of writes to a single 
table, and the select-only one is interesting to gauge either cached or 
uncached read speed (depending on the scale).  It's not very useful for 
getting a feel for how something with a mixed read/write workload does 
though, which is unfortunate because I think that scenario is much more 
common than what it does test.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith

Scott Carey wrote:

Have PG wait a half second (configurable) after the checkpoint fsync()
completes before deleting/ overwriting any WAL segments.  This would be a
trivial "feature" to add to a postgres release, I think.  Actually, it
already exists!  Turn on log archiving, and have the script that it runs after 
a checkpoint sleep().
  
That won't help.  Once the checkpoint is done, the problem isn't just 
that the WAL segments are recycled.  The server isn't going to use them 
even if they were there.  The reason why you can erase/recycle them is 
that you're doing so *after* writing out a checkpoint record that says 
you don't have to ever look at them again.  What you'd actually have to 
do is hack the server code to insert that delay after every fsync--there 
are none that you can cheat on and not introduce a corruption 
possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't 
make a lot of assumptions about what the underlying disk has to actually 
do beyond the fsync requirement; the flip side to that robustness is 
that it's the one you can't ever violate safely.

BTW, the information I have seen indicates that the write cache is 256K on
the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
space).
  
Right.  It's not used like the write-cache on a regular hard drive, 
where they're buffering 8MB-32MB worth of writes just to keep seek 
overhead down.  It's there primarily to allow combining writes into 
large chunks, to better match the block size of the underlying SSD flash 
cells (128K).  Having enough space for two full cells allows spooling 
out the flash write to a whole block while continuing to buffer the next 
one.


This is why turning the cache off can tank performance so badly--you're 
going to be writing a whole 128K block no matter what if it's force to 
disk without caching, even if it's just to write a 8K page to it.  
That's only going to reach 1/16 of the usual write speed on single page 
writes.  And that's why you should also be concerned at whether 
disabling the write cache impacts the drive longevity, lots of small 
writes going out in small chunks is going to wear flash out much faster 
than if the drive is allowed to wait until it's got a full sized block 
to write every time.


The fact that the cache is so small is also why it's harder to catch the 
drive doing the wrong thing here.  The plug test is pretty sensitive to 
a problem when you've got megabytes worth of cached writes that are 
spooling to disk at spinning hard drive speeds.  The window for loss on 
a SSD with no seek overhead and only a moderate number of KB worth of 
cached data is much, much smaller.  Doesn't mean it's gone though.  It's 
a shame that the design wasn't improved just a little bit; a cheap 
capacitor and blocking new writes once the incoming power dropped is all 
it would take to make these much more reliable for database use.  But 
that would raise the price, and not really help anybody but the small 
subset of the market that cares about durable writes.

4: Yet another solution:  The drives DO adhere to write barriers properly.
A filesystem that used these in the process of fsync() would be fine too.
So XFS without LVM or MD (or the newer versions of those that don't ignore
barriers) would work too.
  
If I really trusted anything beyond the very basics of the filesystem to 
really work well on Linux, this whole issue would be moot for most of 
the production deployments I do.  Ideally, fsync would just push out the 
minimum of what's needed, it would call the appropriate write cache 
flush mechanism the way the barrier implementation does when that all 
works, life would be good.  Alternately, you might even switch to using 
O_SYNC writes instead, which on a good filesystem implementation are 
both accelerated and safe compared to write/fsync (I've seen that work 
as expected on Vertias VxFS for example). 

Meanwhile, in the actual world we live, patches that make writes more 
durable by default are dropped by the Linux community because they tank 
performance for too many types of loads, I'm frightened to turn on 
O_SYNC at all on ext3 because of reports of corruption on the lists 
here, fsync does way more work than it needs to, and the way the 
filesystem and block drivers have been separated makes it difficult to 
do any sort of device write cache control from userland.  This is why I 
try to use the simplest, best tested approach out there whenever possible.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Brad Nicholson
On Thu, 2009-11-19 at 19:01 +0100, Anton Rommerskirchen wrote:
> Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer:
> > On 19/11/2009 12:22 PM, Scott Carey wrote:
> > > 3:  Have PG wait a half second (configurable) after the checkpoint
> > > fsync() completes before deleting/ overwriting any WAL segments.  This
> > > would be a trivial "feature" to add to a postgres release, I think.
> >
> > How does that help? It doesn't provide any guarantee that the data has
> > hit main storage - it could lurk in SDD cache for hours.
> >
> > > 4: Yet another solution:  The drives DO adhere to write barriers
> > > properly. A filesystem that used these in the process of fsync() would be
> > > fine too. So XFS without LVM or MD (or the newer versions of those that
> > > don't ignore barriers) would work too.
> >
> > *if* the WAL is also on the SSD.
> >
> > If the WAL is on a separate drive, the write barriers do you no good,
> > because they won't ensure that the data hits the main drive storage
> > before the WAL recycling hits the WAL disk storage. The two drives
> > operate independently and the write barriers don't interact.
> >
> > You'd need some kind of inter-drive write barrier.
> >
> > --
> > Craig Ringer
> 
> 
> Hello !
> 
> as i understand this:
> ssd performace is great, but caching is the problem.
> 
> questions:
> 
> 1. what about conventional disks with 32/64 mb cache ? how do they handle the 
> plug test if their caches are on ?

If the aren't battery backed, they can lose data.  This is not specific
to SSD.

> 2. what about using seperated power supply for the disks ? it it possible to 
> write back the cache after switching the sata to another machine controller ?

Not sure.  I only use devices with battery backed caches or no cache.  I
would be concerned however about the drive not flushing itself and still
running out of power.

> 3. what about making a statement about a lacking enterprise feature (aka 
> emergency battery equipped ssd) and submitting this to the producers ?

The producers aren't making Enterprise products, they are using caches
to accelerate the speeds of consumer products to make their drives more
appealing to consumers.  They aren't going to slow them down to make
them more reliable, especially when the core consumer doesn't know about
this issue, and is even less likely to understand it if explained.

They may stamp the word Enterprise on them, but it's nothing more than
marketing.

> I found that one of them (OCZ) seems to handle suggestions of customers (see 
> write speed discussins on vertex fro example)
> 
> and another (intel) seems to handle serious problems with his disks in 
> rewriting and sometimes redesigning his products - if you tell them and 
> market dictades to react (see degeneration of performace before 1.11 
> firmware).
> 
> perhaps its time to act and not only to complain about the fact.

Or, you could just buy higher quality equipment that was designed with
this in mind.

There is nothing unique to SSD here IMHO.  I wouldn't run my production
grade databases on consumer grade HDD, I wouldn't run them on consumer
grade SSD either.


-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Anton Rommerskirchen
Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer:
> On 19/11/2009 12:22 PM, Scott Carey wrote:
> > 3:  Have PG wait a half second (configurable) after the checkpoint
> > fsync() completes before deleting/ overwriting any WAL segments.  This
> > would be a trivial "feature" to add to a postgres release, I think.
>
> How does that help? It doesn't provide any guarantee that the data has
> hit main storage - it could lurk in SDD cache for hours.
>
> > 4: Yet another solution:  The drives DO adhere to write barriers
> > properly. A filesystem that used these in the process of fsync() would be
> > fine too. So XFS without LVM or MD (or the newer versions of those that
> > don't ignore barriers) would work too.
>
> *if* the WAL is also on the SSD.
>
> If the WAL is on a separate drive, the write barriers do you no good,
> because they won't ensure that the data hits the main drive storage
> before the WAL recycling hits the WAL disk storage. The two drives
> operate independently and the write barriers don't interact.
>
> You'd need some kind of inter-drive write barrier.
>
> --
> Craig Ringer


Hello !

as i understand this:
ssd performace is great, but caching is the problem.

questions:

1. what about conventional disks with 32/64 mb cache ? how do they handle the 
plug test if their caches are on ?

2. what about using seperated power supply for the disks ? it it possible to 
write back the cache after switching the sata to another machine controller ?

3. what about making a statement about a lacking enterprise feature (aka 
emergency battery equipped ssd) and submitting this to the producers ?

I found that one of them (OCZ) seems to handle suggestions of customers (see 
write speed discussins on vertex fro example)

and another (intel) seems to handle serious problems with his disks in 
rewriting and sometimes redesigning his products - if you tell them and 
market dictades to react (see degeneration of performace before 1.11 
firmware).

perhaps its time to act and not only to complain about the fact.

(btw: found funny bonnie++ for my intel 160 gb postville and my samsung pb22 
after using the sam for now approx. 3 months+ ... my conclusion: NOT all SSD 
are equal ...)

best regards 

anton

-- 

ATRSoft GmbH
Bivetsweg 12
D 41542 Dormagen
Deutschland
Tel .: +49(0)2182 8339951
Mobil: +49(0)172 3490817

Geschäftsführer Anton Rommerskirchen

Köln HRB 44927
STNR 122/5701 - 2030
USTID DE213791450

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Scott Marlowe
On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure  wrote:
> On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey  wrote:
>> Well, that is sort of true for all benchmarks, but I do find that bonnie++
>> is the worst of the bunch.  I consider it relatively useless compared to
>> fio.  Its just not a great benchmark for server type load and I find it
>> lacking in the ability to simulate real applications.
>
> I agree.   My biggest gripe with bonnie actually is that 99% of the
> time is spent measuring in sequential tests which is not that
> important in the database world.  Dedicated wal volume uses ostensibly
> sequential io, but it's fairly difficult to outrun a dedicated wal
> volume even if it's on a vanilla sata drive.
>
> pgbench is actually a pretty awesome i/o tester assuming you have big
> enough scaling factor, because:
> a) it's much closer to the environment you will actually run in
> b) you get to see what i/o affecting options have on the load
> c) you have broad array of options regarding what gets done (select
> only, -f, etc)
> d) once you build the test database, you can do multiple runs without
> rebuilding it

Seeing as how pgbench only goes to scaling factor of 4000, are the any
plans on enlarging that number?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Merlin Moncure
On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey  wrote:
> Well, that is sort of true for all benchmarks, but I do find that bonnie++
> is the worst of the bunch.  I consider it relatively useless compared to
> fio.  Its just not a great benchmark for server type load and I find it
> lacking in the ability to simulate real applications.

I agree.   My biggest gripe with bonnie actually is that 99% of the
time is spent measuring in sequential tests which is not that
important in the database world.  Dedicated wal volume uses ostensibly
sequential io, but it's fairly difficult to outrun a dedicated wal
volume even if it's on a vanilla sata drive.

pgbench is actually a pretty awesome i/o tester assuming you have big
enough scaling factor, because:
a) it's much closer to the environment you will actually run in
b) you get to see what i/o affecting options have on the load
c) you have broad array of options regarding what gets done (select
only, -f, etc)
d) once you build the test database, you can do multiple runs without
rebuilding it

merlin

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith

Scott Carey wrote:

Moral of the story:  Nothing is 100% safe, so sometimes a small bit of KNOWN
risk is perfectly fine.  There is always UNKNOWN risk.  If one risks losing
256K of cached data on an SSD if you're really unlucky with timing, how
dangerous is that versus the chance that the raid card or other hardware
barfs and takes out your whole WAL?
  
I think the point of the paranoia in this thread is that if you're 
introducing a component with a known risk in it, you're really asking 
for trouble because (as you point out) it's hard enough to keep a system 
running just through the unexpected ones that shouldn't have happened at 
all.  No need to make that even harder by introducing something that is 
*known* to fail under some conditions.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Karl Denninger
Greg Smith wrote:
> Scott Carey wrote:
>> For your database DATA disks, leaving the write cache on is 100%
>> acceptable,
>> even with power loss, and without a RAID controller.  And even in
>> high write
>> environments.
>>
>> That is what the XLOG is for, isn't it?  That is where this behavior is
>> critical.  But that has completely different performance requirements
>> and
>> need not bee on the same volume, array, or drive.
>>   
> At checkpoint time, writes to the main data files are done that are
> followed by fsync calls to make sure those blocks have been written to
> disk.  Those writes have exactly the same consistency requirements as
> the more frequent pg_xlog writes.  If the drive ACKs the write, but
> it's not on physical disk yet, it's possible for the checkpoint to
> finish and the underlying pg_xlog segments needed to recover from a
> crash at that point to be deleted.  The end of the checkpoint can wipe
> out many WAL segments, presuming they're not needed anymore because
> the data blocks they were intended to fix during recovery are now
> guaranteed to be on disk.
Guys, read that again.

IF THE DISK OR DRIVER ACK'S A FSYNC CALL THE WAL ENTRY IS LIKELY GONE,
AND YOU ARE SCREWED IF THE DATA IS NOT REALLY ON THE DISK.

-- Karl
<>
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith

Scott Carey wrote:

For your database DATA disks, leaving the write cache on is 100% acceptable,
even with power loss, and without a RAID controller.  And even in high write
environments.

That is what the XLOG is for, isn't it?  That is where this behavior is
critical.  But that has completely different performance requirements and
need not bee on the same volume, array, or drive.
  
At checkpoint time, writes to the main data files are done that are 
followed by fsync calls to make sure those blocks have been written to 
disk.  Those writes have exactly the same consistency requirements as 
the more frequent pg_xlog writes.  If the drive ACKs the write, but it's 
not on physical disk yet, it's possible for the checkpoint to finish and 
the underlying pg_xlog segments needed to recover from a crash at that 
point to be deleted.  The end of the checkpoint can wipe out many WAL 
segments, presuming they're not needed anymore because the data blocks 
they were intended to fix during recovery are now guaranteed to be on disk.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-19 Thread Craig Ringer
On 19/11/2009 12:22 PM, Scott Carey wrote:

> 3:  Have PG wait a half second (configurable) after the checkpoint fsync()
> completes before deleting/ overwriting any WAL segments.  This would be a
> trivial "feature" to add to a postgres release, I think.

How does that help? It doesn't provide any guarantee that the data has
hit main storage - it could lurk in SDD cache for hours.

> 4: Yet another solution:  The drives DO adhere to write barriers properly.
> A filesystem that used these in the process of fsync() would be fine too.
> So XFS without LVM or MD (or the newer versions of those that don't ignore
> barriers) would work too.

*if* the WAL is also on the SSD.

If the WAL is on a separate drive, the write barriers do you no good,
because they won't ensure that the data hits the main drive storage
before the WAL recycling hits the WAL disk storage. The two drives
operate independently and the write barriers don't interact.

You'd need some kind of inter-drive write barrier.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey

On 11/17/09 10:58 PM, "da...@lang.hm"  wrote:
> 
> keep in mind that bonnie++ isn't always going to reflect your real
> performance.
> 
> I have run tests on some workloads that were definantly I/O limited where
> bonnie++ results that differed by a factor of 10x made no measurable
> difference in the application performance, so I can easily believe in
> cases where bonnie++ numbers would not change but application performance
> could be drasticly different.
> 

Well, that is sort of true for all benchmarks, but I do find that bonnie++
is the worst of the bunch.  I consider it relatively useless compared to
fio.  Its just not a great benchmark for server type load and I find it
lacking in the ability to simulate real applications.


> as always it can depend heavily on your workload. you really do need to
> figure out how to get your hands on one for your own testing.
> 
> David Lang
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
> 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey

On 11/17/09 10:51 AM, "Greg Smith"  wrote:

> Merlin Moncure wrote:
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.
> The funny thing about Murphy is that he doesn't visit when things are
> quiet.  It's quite possible the window for data loss on the drive is
> very small.  Maybe you only see it one out of 10 pulls with a very
> aggressive database-oriented write test.  Whatever the odd conditions
> are, you can be sure you'll see them when there's a bad outage in actual
> production though.

Yes, but there is nothing fool proof.  Murphy visited me recently, and the
RAID card with BBU cache that the WAL logs were on crapped out.  Data was
fine.

Had to fix up the system without any WAL logs.  Luckily, out of 10TB, only
200GB or so of it could have been in the process of writing (yay!
partitioning by date!) to and we could restore just that part rather than
initiating a full restore.
Then there was fun times in single user mode to fix corrupted system tables
(about half the system indexes were dead, and the statistics table was
corrupt, but that could be truncated safely).

Its all fine now with all data validated.

Moral of the story:  Nothing is 100% safe, so sometimes a small bit of KNOWN
risk is perfectly fine.  There is always UNKNOWN risk.  If one risks losing
256K of cached data on an SSD if you're really unlucky with timing, how
dangerous is that versus the chance that the raid card or other hardware
barfs and takes out your whole WAL?

Nothing is safe enough to avoid a full DR plan of action.  The individual
tradeoffs are very application and data dependent.


> 
> A good test program that is a bit better at introducing and detecting
> the write cache issue is described at
> http://brad.livejournal.com/2116715.html
> 
> --
> Greg Smith2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
> g...@2ndquadrant.com  www.2ndQuadrant.com
> 
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
> 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-18 Thread Tom Lane
Scott Carey  writes:
> For your database DATA disks, leaving the write cache on is 100% acceptable,
> even with power loss, and without a RAID controller.  And even in high write
> environments.

Really?  How hard have you tested that configuration?

> That is what the XLOG is for, isn't it?

Once we have fsync'd a data change, we discard the relevant XLOG
entries.  If the disk hasn't actually put the data on stable storage
before it claims the fsync is done, you're screwed.

XLOG only exists to centralize the writes that have to happen before
a transaction can be reported committed (in particular, to avoid a
lot of random-access writes at commit).  It doesn't make any
fundamental change in the rules of the game: a disk that lies about
write complete will still burn you.

In a zero-seek-cost environment I suspect that XLOG wouldn't actually
be all that useful.  I gather from what's been said earlier that SSDs
don't fully eliminate random-access penalties, though.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey

On 11/15/09 12:46 AM, "Craig Ringer"  wrote:
> Possible fixes for this are:
> 
> - Don't let the drive lie about cache flush operations, ie disable write
> buffering.
> 
> - Give Pg some way to find out, from the drive, when particular write
> operations have actually hit disk. AFAIK there's no such mechanism at
> present, and I don't think the drives are even capable of reporting this
> data. If they were, Pg would have to be capable of applying entries from
> the WAL "sparsely" to account for the way the drive's write cache
> commits changes out-of-order, and Pg would have to maintain a map of
> committed / uncommitted WAL records. Pg would need another map of
> tablespace blocks to WAL records to know, when a drive write cache
> commit notice came in, what record in what WAL archive was affected.
> It'd also require Pg to keep WAL archives for unbounded and possibly
> long periods of time, making disk space management for WAL much harder.
> So - "not easy" is a bit of an understatement here.

3:  Have PG wait a half second (configurable) after the checkpoint fsync()
completes before deleting/ overwriting any WAL segments.  This would be a
trivial "feature" to add to a postgres release, I think.  Actually, it
already exists!

Turn on log archiving, and have the script that it runs after a checkpoint
sleep().

BTW, the information I have seen indicates that the write cache is 256K on
the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
space).

4: Yet another solution:  The drives DO adhere to write barriers properly.
A filesystem that used these in the process of fsync() would be fine too.
So XFS without LVM or MD (or the newer versions of those that don't ignore
barriers) would work too.

So, I think that write caching may not be necessary to turn off for non-xlog
disk.

> 
> You still need to turn off write caching.
> 
> --
> Craig Ringer
> 
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
> 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey



On 11/13/09 10:21 AM, "Karl Denninger"  wrote:

> 
> One caution for those thinking of doing this - the incremental
> improvement of this setup on PostGresql in WRITE SIGNIFICANT environment
> isn't NEARLY as impressive.  Indeed the performance in THAT case for
> many workloads may only be 20 or 30% faster than even "reasonably
> pedestrian" rotating media in a high-performance (lots of spindles and
> thus stripes) configuration and it's more expensive (by a lot.)  If you
> step up to the fast SAS drives on the rotating side there's little
> argument for the SSD at all (again, assuming you don't intend to "cheat"
> and risk data loss.)

For your database DATA disks, leaving the write cache on is 100% acceptable,
even with power loss, and without a RAID controller.  And even in high write
environments.

That is what the XLOG is for, isn't it?  That is where this behavior is
critical.  But that has completely different performance requirements and
need not bee on the same volume, array, or drive.

> 
> Know your application and benchmark it.
> 
> -- Karl
> 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-18 Thread Kenny Gorman

I found a bit of time to play with this.

I started up a test with 20 concurrent processes all inserting into  
the same table and committing after each insert.  The db was achieving  
about 5000 inserts per second, and I kept it running for about 10  
minutes.  The host was doing about 5MB/s of Physical I/O to the Fusion  
IO drive. I set checkpoint segments very small (10).  I observed the  
following message in the log: checkpoints are occurring too frequently  
(16 seconds apart).  Then I pulled the cord.  On reboot I noticed that  
Fusion IO replayed it's log, then the filesystem (vxfs) did the same.   
Then I started up the DB and observed the it perform auto-recovery:


Nov 18 14:33:53 frutestdb002 postgres[5667]: [6-1] 2009-11-18 14:33:53  
PSTLOG:  database system was not properly shut down; automatic  
recovery in progress
Nov 18 14:33:53 frutestdb002 postgres[5667]: [7-1] 2009-11-18 14:33:53  
PSTLOG:  redo starts at 2A/55F9D478
Nov 18 14:33:54 frutestdb002 postgres[5667]: [8-1] 2009-11-18 14:33:54  
PSTLOG:  record with zero length at 2A/56692F38
Nov 18 14:33:54 frutestdb002 postgres[5667]: [9-1] 2009-11-18 14:33:54  
PSTLOG:  redo done at 2A/56692F08
Nov 18 14:33:54 frutestdb002 postgres[5667]: [10-1] 2009-11-18  
14:33:54 PSTLOG:  database system is ready


Thanks
Kenny

On Nov 13, 2009, at 1:35 PM, Kenny Gorman wrote:

The FusionIO products are a little different.  They are card based  
vs trying to emulate a traditional disk.  In terms of volatility,  
they have an on-board capacitor that allows power to be supplied  
until all writes drain.  They do not have a cache in front of them  
like a disk-type SSD might.   I don't sell these things, I am just a  
fan.  I verified all this with the Fusion IO techs before I  
replied.  Perhaps older versions didn't have this functionality?  I  
am not sure.  I have already done some cold power off tests w/o  
problems, but I could up the workload a bit and retest.  I will do a  
couple of 'pull the cable' tests on monday or tuesday and report  
back how it goes.


Re the performance #'s...  Here is my post:

http://www.kennygorman.com/wordpress/?p=398

-kg


>In order for a drive to work reliably for database use such as for
>PostgreSQL, it cannot have a volatile write cache.  You either need a
>write cache with a battery backup (and a UPS doesn't count), or to  
turn
>the cache off.  The SSD performance figures you've been looking at  
are
>with the drive's write cache turned on, which means they're  
completely

>fictitious and exaggerated upwards for your purposes.  In the real
>world, that will result in database corruption after a crash one day.
>No one on the drive benchmarking side of the industry seems to have
>picked up on this, so you can't use any of those figures.  I'm not  
even
>sure right now whether drives like Intel's will even meet their  
lifetime
>expectations if they aren't allowed to use their internal volatile  
write

>cache.
>
>Here's two links you should read and then reconsider your whole  
design:

>
>http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
>http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html
>
>I can't even imagine how bad the situation would be if you decide to
>wander down the "use a bunch of really cheap SSD drives" path; these
>things are barely usable for databases with Intel's hardware.  The  
needs
>of people who want to throw SSD in a laptop and those of the  
enterprise

>database market are really different, and if you believe doom
>forecasting like the comments at
>http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
>that gap is widening, not shrinking.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread david

On Wed, 18 Nov 2009, Greg Smith wrote:


Merlin Moncure wrote:
But what's up with the 400 iops measured from bonnie++? 
I don't know really.  SSD writes are really sensitive to block size and the 
ability to chunk writes into larger chunks, so it may be that Peter has just 
found the worst-case behavior and everybody else is seeing something better 
than that.


When the reports I get back from people I believe are competant--Vadim, 
Peter--show worst-case results that are lucky to beat RAID10, I feel I have 
to dismiss the higher values reported by people who haven't been so careful. 
And that's just about everybody else, which leaves me quite suspicious of the 
true value of the drives.  The whole thing really sets off my vendor hype 
reflex, and short of someone loaning me a drive to test I'm not sure how to 
get past that.  The Intel drives are still just a bit too expensive to buy 
one on a whim, such that I'll just toss it if the drive doesn't live up to 
expectations.


keep in mind that bonnie++ isn't always going to reflect your real 
performance.


I have run tests on some workloads that were definantly I/O limited where 
bonnie++ results that differed by a factor of 10x made no measurable 
difference in the application performance, so I can easily believe in 
cases where bonnie++ numbers would not change but application performance 
could be drasticly different.


as always it can depend heavily on your workload. you really do need to 
figure out how to get your hands on one for your own testing.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Greg Smith

Merlin Moncure wrote:
But what's up with the 400 iops measured from bonnie++?  
I don't know really.  SSD writes are really sensitive to block size and 
the ability to chunk writes into larger chunks, so it may be that Peter 
has just found the worst-case behavior and everybody else is seeing 
something better than that.


When the reports I get back from people I believe are competant--Vadim, 
Peter--show worst-case results that are lucky to beat RAID10, I feel I 
have to dismiss the higher values reported by people who haven't been so 
careful.  And that's just about everybody else, which leaves me quite 
suspicious of the true value of the drives.  The whole thing really sets 
off my vendor hype reflex, and short of someone loaning me a drive to 
test I'm not sure how to get past that.  The Intel drives are still just 
a bit too expensive to buy one on a whim, such that I'll just toss it if 
the drive doesn't live up to expectations.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Mark Mielke

On 11/17/2009 01:51 PM, Greg Smith wrote:

Merlin Moncure wrote:

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.
The funny thing about Murphy is that he doesn't visit when things are 
quiet.  It's quite possible the window for data loss on the drive is 
very small.  Maybe you only see it one out of 10 pulls with a very 
aggressive database-oriented write test.  Whatever the odd conditions 
are, you can be sure you'll see them when there's a bad outage in 
actual production though.


A good test program that is a bit better at introducing and detecting 
the write cache issue is described at 
http://brad.livejournal.com/2116715.html




I've been following this thread with great interest in your results... 
Please continue to share...


For write cache issues - is it possible that the reduced power 
utilization of SSD allows for a capacitor to complete all scheduled 
writes, even with a large cache? Is it this particular drive you are 
suggesting that is known to be insufficient or is it really the 
technology or maturity of the technology?


Cheers,
mark

--
Mark Mielke


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Merlin Moncure
On Tue, Nov 17, 2009 at 1:51 PM, Greg Smith  wrote:
> Merlin Moncure wrote:
>>
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.
>
> The funny thing about Murphy is that he doesn't visit when things are quiet.
>  It's quite possible the window for data loss on the drive is very small.
>  Maybe you only see it one out of 10 pulls with a very aggressive
> database-oriented write test.  Whatever the odd conditions are, you can be
> sure you'll see them when there's a bad outage in actual production though.
>
> A good test program that is a bit better at introducing and detecting the
> write cache issue is described at http://brad.livejournal.com/2116715.html

Sure, not disputing that...I don't have one to test myself, so I can't
vouch for the data being safe.  But what's up with the 400 iops
measured from bonnie++?  That's an order of magnitude slower than any
other published benchmark on the 'net, and I'm dying to get a little
clarification here.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Greg Smith

Merlin Moncure wrote:

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.
The funny thing about Murphy is that he doesn't visit when things are 
quiet.  It's quite possible the window for data loss on the drive is 
very small.  Maybe you only see it one out of 10 pulls with a very 
aggressive database-oriented write test.  Whatever the odd conditions 
are, you can be sure you'll see them when there's a bad outage in actual 
production though.


A good test program that is a bit better at introducing and detecting 
the write cache issue is described at 
http://brad.livejournal.com/2116715.html


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Peter Eisentraut
On tis, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
> I am right now talking to someone on postgresql irc who is measuring
> 15k iops from x25-e and no data loss following power plug test.  I am
> becoming increasingly suspicious that peter's results are not
> representative: given that 90% of bonnie++ seeks are read only, the
> math doesn't add up, and they contradict broadly published tests on
> the internet.  Has anybody independently verified the results?

Notably, between my two blog posts and this email thread, there have
been claims of

400
1800
4000
7000
14000
15000
35000

iops (of some kind) per second.

That alone should be cause of concern.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Scott Marlowe
On Tue, Nov 17, 2009 at 9:54 AM, Brad Nicholson
 wrote:
> On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
>> 2009/11/13 Greg Smith :
>> > As far as what real-world apps have that profile, I like SSDs for small to
>> > medium web applications that have to be responsive, where the user shows up
>> > and wants their randomly distributed and uncached data with minimal 
>> > latency.
>> > SSDs can also be used effectively as second-tier targeted storage for 
>> > things
>> > that have a performance-critical but small and random bit as part of a
>> > larger design that doesn't have those characteristics; putting indexes on
>> > SSD can work out well for example (and there the write durability stuff
>> > isn't quite as critical, as you can always drop an index and rebuild if it
>> > gets corrupted).
>>
>> I am right now talking to someone on postgresql irc who is measuring
>> 15k iops from x25-e and no data loss following power plug test.  I am
>> becoming increasingly suspicious that peter's results are not
>> representative: given that 90% of bonnie++ seeks are read only, the
>> math doesn't add up, and they contradict broadly published tests on
>> the internet.  Has anybody independently verified the results?
>
> How many times have the run the plug test?  I've read other reports of
> people (not on Postgres) losing data on this drive with the write cache
> on.

When I run the plug test it's on a pgbench that's as big as possible
(~4000) and I remove memory if there's a lot in the server so the
memory is smaller than the db.  I run 100+ concurrent and I set
checkoint timeouts to 30 minutes, and make a lots of checkpoint
segments (100 or so), and set completion target to 0.  Then after
about 1/2 checkpoint timeout has passed, I issue a checkpoint from the
command line, take a deep breath and pull the cord.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Brad Nicholson
On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
> 2009/11/13 Greg Smith :
> > As far as what real-world apps have that profile, I like SSDs for small to
> > medium web applications that have to be responsive, where the user shows up
> > and wants their randomly distributed and uncached data with minimal latency.
> > SSDs can also be used effectively as second-tier targeted storage for things
> > that have a performance-critical but small and random bit as part of a
> > larger design that doesn't have those characteristics; putting indexes on
> > SSD can work out well for example (and there the write durability stuff
> > isn't quite as critical, as you can always drop an index and rebuild if it
> > gets corrupted).
> 
> I am right now talking to someone on postgresql irc who is measuring
> 15k iops from x25-e and no data loss following power plug test.  I am
> becoming increasingly suspicious that peter's results are not
> representative: given that 90% of bonnie++ seeks are read only, the
> math doesn't add up, and they contradict broadly published tests on
> the internet.  Has anybody independently verified the results?

How many times have the run the plug test?  I've read other reports of
people (not on Postgres) losing data on this drive with the write cache
on.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-17 Thread Merlin Moncure
2009/11/13 Greg Smith :
> As far as what real-world apps have that profile, I like SSDs for small to
> medium web applications that have to be responsive, where the user shows up
> and wants their randomly distributed and uncached data with minimal latency.
> SSDs can also be used effectively as second-tier targeted storage for things
> that have a performance-critical but small and random bit as part of a
> larger design that doesn't have those characteristics; putting indexes on
> SSD can work out well for example (and there the write durability stuff
> isn't quite as critical, as you can always drop an index and rebuild if it
> gets corrupted).

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.  I am
becoming increasingly suspicious that peter's results are not
representative: given that 90% of bonnie++ seeks are read only, the
math doesn't add up, and they contradict broadly published tests on
the internet.  Has anybody independently verified the results?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-15 Thread Heikki Linnakangas
Craig James wrote:
> I've wondered whether this would work for a read-mostly application: Buy
> a big RAM machine, like 64GB, with a crappy little single disk.  Build
> the database, then make a really big RAM disk, big enough to hold the DB
> and the WAL.  Then build a duplicate DB on another machine with a decent
> disk (maybe a 4-disk RAID10), and turn on WAL logging.
> 
> The system would be blazingly fast, and you'd just have to be sure
> before you shut it off to shut down Postgres and copy the RAM files back
> to the regular disk.  And if you didn't, you could always recover from
> the backup.  Since it's a read-mostly system, the WAL logging bandwidth
> wouldn't be too high, so even a modest machine would be able to keep up.

Should work, but I don't see any advantage over attaching the RAID array
directly to the 1st machine with the RAM and turning synchronous_commit=off.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-15 Thread Craig James

I've wondered whether this would work for a read-mostly application: Buy a big 
RAM machine, like 64GB, with a crappy little single disk.  Build the database, 
then make a really big RAM disk, big enough to hold the DB and the WAL.  Then 
build a duplicate DB on another machine with a decent disk (maybe a 4-disk 
RAID10), and turn on WAL logging.

The system would be blazingly fast, and you'd just have to be sure before you 
shut it off to shut down Postgres and copy the RAM files back to the regular 
disk.  And if you didn't, you could always recover from the backup.  Since it's 
a read-mostly system, the WAL logging bandwidth wouldn't be too high, so even a 
modest machine would be able to keep up.

Any thoughts?

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-15 Thread Laszlo Nagy



- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.
  

Okay, I see. We cannot query erase block size from an SSD drive. :-(

I don't think that any SSD drive has more than some
megabytes of write cache.



The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.
  

Heh, this is why they are so expensive. :-)

The same amount of write cache could easily be
implemented in OS memory, and then Pg would always know what hit the disk.



Really? How does Pg know what order the SSD writes things out from its
cache?
  
I got the point. We cannot implement an efficient write cache without 
much more knowledge about how that particular drive works.


So... the only solution that works well is to have much more RAM for 
read cache, and much more RAM for write cache inside the RAID controller 
(with BBU).


Thank you,

  Laszlo


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-15 Thread Craig Ringer
On 15/11/2009 2:05 PM, Laszlo Nagy wrote:
> 
>> A change has been written to the WAL and fsync()'d, so Pg knows it's hit
>> disk. It can now safely apply the change to the tables themselves, and
>> does so, calling fsync() to tell the drive containing the tables to
>> commit those changes to disk.
>>
>> The drive lies, returning success for the fsync when it's just cached
>> the data in volatile memory. Pg carries on, shortly deleting the WAL
>> archive the changes were recorded in or recycling it and overwriting it
>> with new change data. The SSD is still merrily buffering data to write
>> cache, and hasn't got around to writing your particular change yet.
>>   
> All right. I believe you. In the current Pg implementation, I need to
> turn of disk cache.

That's certainly my understanding. I've been wrong many times before :S

> #1. user wants to change something, resulting in a write_to_disk(data) call
> #2. data is written into the WAL and fsync()-ed
> #3. at this point the write_to_disk(data) call CAN RETURN, the user can
> continue his work (the WAL is already written, changes cannot be lost)
> #4. Pg can continue writting data onto the disk, and fsync() it.
> #5. Then WAL archive data can be deleted.
> 
> Now maybe I'm wrong, but between #3 and #5, the data to be written is
> kept in memory. This is basically a write cache, implemented in OS
> memory. We could really handle it like a write cache. E.g. everything
> would remain the same, except that we add some latency. We can wait some
> time after the last modification of a given block, and then write it out.

I don't know enough about the whole affair to give you a good
explanation ( I tried, and it just showed me how much I didn't know )
but here are a few issues:

- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.

- The only way to make this help the SSD out much would be to use a LOT
of RAM for write cache and maintain a LOT of WAL archives. That's RAM
not being used for caching read data. The large number of WAL archives
means incredibly long WAL replay times after a crash.

- You still need a reliable way to tell the SSD "really flush your cache
now" after you've flushed the changes from your huge chunks of WAL files
and are getting ready to recycle them.

I was thinking that write ordering would be an issue too, as some
changes in the WAL would hit main disk before others that were earlier
in the WAL. However, I don't think that matters if full_page_writes are
on. If you replay from the start, you'll reapply some changes with older
versions, but they'll be corrected again by a later WAL record. So
ordering during WAL replay shouldn't be a problem. On the other hand,
the INCREDIBLY long WAL replay times during recovery would be a nightmare.

> I don't think that any SSD drive has more than some
> megabytes of write cache.

The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.

> The same amount of write cache could easily be
> implemented in OS memory, and then Pg would always know what hit the disk.

Really? How does Pg know what order the SSD writes things out from its
cache?

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-15 Thread Laszlo Nagy



A change has been written to the WAL and fsync()'d, so Pg knows it's hit
disk. It can now safely apply the change to the tables themselves, and
does so, calling fsync() to tell the drive containing the tables to
commit those changes to disk.

The drive lies, returning success for the fsync when it's just cached
the data in volatile memory. Pg carries on, shortly deleting the WAL
archive the changes were recorded in or recycling it and overwriting it
with new change data. The SSD is still merrily buffering data to write
cache, and hasn't got around to writing your particular change yet.
  
All right. I believe you. In the current Pg implementation, I need to 
turn of disk cache.


But I would like to ask some theoretical questions. It is just an 
idea from me, and probably I'm wrong.

Here is a scenario:

#1. user wants to change something, resulting in a write_to_disk(data) call
#2. data is written into the WAL and fsync()-ed
#3. at this point the write_to_disk(data) call CAN RETURN, the user can 
continue his work (the WAL is already written, changes cannot be lost)

#4. Pg can continue writting data onto the disk, and fsync() it.
#5. Then WAL archive data can be deleted.

Now maybe I'm wrong, but between #3 and #5, the data to be written is 
kept in memory. This is basically a write cache, implemented in OS 
memory. We could really handle it like a write cache. E.g. everything 
would remain the same, except that we add some latency. We can wait some 
time after the last modification of a given block, and then write it out.


Is it possible to do? If so, then can we can turn off write cache for 
all drives, except the one holding the WAL. And still write speed would 
remain the same. I don't think that any SSD drive has more than some 
megabytes of write cache. The same amount of write cache could easily be 
implemented in OS memory, and then Pg would always know what hit the disk.


Thanks,

  Laci


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-15 Thread Craig Ringer
On 15/11/2009 11:57 AM, Laszlo Nagy wrote:

> Ok, I'm getting confused here. There is the WAL, which is written
> sequentially. If the WAL is not corrupted, then it can be replayed on
> next database startup. Please somebody enlighten me! In my mind, fsync
> is only needed for the WAL. If I could configure postgresql to put the
> WAL on a real hard drive that has BBU and write cache, then I cannot
> loose data. Meanwhile, product table data could be placed on the SSD
> drive, and I sould be able to turn on write cache safely. Am I wrong?

A change has been written to the WAL and fsync()'d, so Pg knows it's hit
disk. It can now safely apply the change to the tables themselves, and
does so, calling fsync() to tell the drive containing the tables to
commit those changes to disk.

The drive lies, returning success for the fsync when it's just cached
the data in volatile memory. Pg carries on, shortly deleting the WAL
archive the changes were recorded in or recycling it and overwriting it
with new change data. The SSD is still merrily buffering data to write
cache, and hasn't got around to writing your particular change yet.

The machine loses power.

Oops! A hole just appeared in history. A WAL replay won't re-apply the
changes that the database guaranteed had hit disk, but the changes never
made it onto the main database storage.

Possible fixes for this are:

- Don't let the drive lie about cache flush operations, ie disable write
buffering.

- Give Pg some way to find out, from the drive, when particular write
operations have actually hit disk. AFAIK there's no such mechanism at
present, and I don't think the drives are even capable of reporting this
data. If they were, Pg would have to be capable of applying entries from
the WAL "sparsely" to account for the way the drive's write cache
commits changes out-of-order, and Pg would have to maintain a map of
committed / uncommitted WAL records. Pg would need another map of
tablespace blocks to WAL records to know, when a drive write cache
commit notice came in, what record in what WAL archive was affected.
It'd also require Pg to keep WAL archives for unbounded and possibly
long periods of time, making disk space management for WAL much harder.
So - "not easy" is a bit of an understatement here.

You still need to turn off write caching.

--
Craig Ringer


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Laszlo Nagy




   * I could buy two X25-E drives and have 32GB disk space, and some
 redundancy. This would cost about $1600, not counting the RAID
 controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). 
Not a cheap solution, but the performance is amazing.


I've came across this article:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ 



It's from a Linux MySQL user so it's a bit confusing but it looks like 
he has some reservations about performance vs reliability of the Intel 
drives - apparently they have their own write cache and when it's 
disabled performance drops sharply.
Ok, I'm getting confused here. There is the WAL, which is written 
sequentially. If the WAL is not corrupted, then it can be replayed on 
next database startup. Please somebody enlighten me! In my mind, fsync 
is only needed for the WAL. If I could configure postgresql to put the 
WAL on a real hard drive that has BBU and write cache, then I cannot 
loose data. Meanwhile, product table data could be placed on the SSD 
drive, and I sould be able to turn on write cache safely. Am I wrong?


 L


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Laszlo Nagy

Robert Haas wrote:

2009/11/14 Laszlo Nagy :
  

32GB is for one table only. This server runs other applications, and you
need to leave space for sort memory, shared buffers etc. Buying 128GB memory
would solve the problem, maybe... but it is too expensive. And it is not
safe. Power out -> data loss.

I'm sorry I though he was talking about keeping the database in memory 
with fsync=off. Now I see he was only talking about the OS disk cache.


My server has 24GB RAM, and I cannot easily expand it unless I throw out 
some 2GB modules, and buy more 4GB or 8GB modules. But... buying 4x8GB 
ECC RAM (+throwing out 4x2GB RAM) is a lot more expensive than buying 
some 64GB SSD drives. 95% of the table in question is not modified. Only 
read (mostly with index scan). Only 5% is actively updated.


This is why I think, using SSD in my case would be effective.

Sorry for the confusion.

 L


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Merlin Moncure
On Sat, Nov 14, 2009 at 8:47 AM, Heikki Linnakangas
 wrote:
> Merlin Moncure wrote:
>> On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
>>  wrote:
 lots of ram doesn't help you if:
 *) your database gets written to a lot and you have high performance
 requirements
>>> When all the (hot) data is cached, all writes are sequential writes to
>>> the WAL, with the occasional flushing of the data pages at checkpoint.
>>> The sequential write bandwidth of SSDs and HDDs is roughly the same.
>>>
>>> I presume the fsync latency is a lot higher with HDDs, so if you're
>>> running a lot of small write transactions, and don't want to risk losing
>>> any recently committed transactions by setting synchronous_commit=off,
>>> the usual solution is to get a RAID controller with a battery-backed up
>>> cache. With a BBU cache, the fsync latency should be in the same
>>> ballpark as with SDDs.
>>
>> BBU raid controllers might only give better burst performance.  If you
>> are writing data randomly all over the volume, the cache will overflow
>> and performance will degrade.
>
> We're discussing a scenario where all the data fits in RAM. That's what
> the large amount of RAM is for. The only thing that's being written to
> disk is the WAL, which is sequential, and the occasional flush of data
> pages from the buffer cache at checkpoints, which doesn't happen often
> and will be spread over a period of time.

We are basically in agreement, but regardless of the effectiveness of
your WAL implementation, raid controller, etc, if you have to write
data to what approximates random locations to a disk based volume in a
sustained manner, you must eventually degrade to whatever the drive
can handle plus whatever efficiencies checkpoint, o/s, can gain by
grouping writes together.  Extra ram mainly helps only because it can
shave precious iops off the read side so you use them for writing.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Robert Haas
2009/11/14 Laszlo Nagy :
> 32GB is for one table only. This server runs other applications, and you
> need to leave space for sort memory, shared buffers etc. Buying 128GB memory
> would solve the problem, maybe... but it is too expensive. And it is not
> safe. Power out -> data loss.

Huh?

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Laszlo Nagy

Heikki Linnakangas wrote:

Laszlo Nagy wrote:
  

   * I need at least 32GB disk space. So DRAM based SSD is not a real
 option. I would have to buy 8x4GB memory, costs a fortune. And
 then it would still not have redundancy.



At 32GB database size, I'd seriously consider just buying a server with
a regular hard drive or a small RAID array for redundancy, and stuffing
16 or 32 GB of RAM into it to ensure everything is cached. That's tried
and tested technology.
  
32GB is for one table only. This server runs other applications, and you 
need to leave space for sort memory, shared buffers etc. Buying 128GB 
memory would solve the problem, maybe... but it is too expensive. And it 
is not safe. Power out -> data loss.

I don't know how you came to the 32 GB figure, but keep in mind that
administration is a lot easier if you have plenty of extra disk space
for things like backups, dumps+restore, temporary files, upgrades etc.
  
This disk space would be dedicated for a smaller tablespace, holding one 
or two bigger tables with index scans. Of course I would never use an 
SSD disk for storing database backups. It would be waste of money.



 L


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Heikki Linnakangas
Merlin Moncure wrote:
> On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
>  wrote:
>>> lots of ram doesn't help you if:
>>> *) your database gets written to a lot and you have high performance
>>> requirements
>> When all the (hot) data is cached, all writes are sequential writes to
>> the WAL, with the occasional flushing of the data pages at checkpoint.
>> The sequential write bandwidth of SSDs and HDDs is roughly the same.
>>
>> I presume the fsync latency is a lot higher with HDDs, so if you're
>> running a lot of small write transactions, and don't want to risk losing
>> any recently committed transactions by setting synchronous_commit=off,
>> the usual solution is to get a RAID controller with a battery-backed up
>> cache. With a BBU cache, the fsync latency should be in the same
>> ballpark as with SDDs.
> 
> BBU raid controllers might only give better burst performance.  If you
> are writing data randomly all over the volume, the cache will overflow
> and performance will degrade.

We're discussing a scenario where all the data fits in RAM. That's what
the large amount of RAM is for. The only thing that's being written to
disk is the WAL, which is sequential, and the occasional flush of data
pages from the buffer cache at checkpoints, which doesn't happen often
and will be spread over a period of time.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Merlin Moncure
On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
 wrote:
>> lots of ram doesn't help you if:
>> *) your database gets written to a lot and you have high performance
>> requirements
>
> When all the (hot) data is cached, all writes are sequential writes to
> the WAL, with the occasional flushing of the data pages at checkpoint.
> The sequential write bandwidth of SSDs and HDDs is roughly the same.
>
> I presume the fsync latency is a lot higher with HDDs, so if you're
> running a lot of small write transactions, and don't want to risk losing
> any recently committed transactions by setting synchronous_commit=off,
> the usual solution is to get a RAID controller with a battery-backed up
> cache. With a BBU cache, the fsync latency should be in the same
> ballpark as with SDDs.

BBU raid controllers might only give better burst performance.  If you
are writing data randomly all over the volume, the cache will overflow
and performance will degrade.  Raid controllers degrade in different
fashions, at least one (perc 5) halted ALL access to the volume and
spun out the cache (a bug, IMO).

>> *) your data is important
>
> Huh? The data is safely on the hard disk in case of a crash. The RAM is
> just for caching.

I was alluding to not being able to lose any transactions... in this
case you can only run fsync, synchronously.  You are then bound by the
capabilities of the volume to write, ram only buffers reads.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Heikki Linnakangas
Merlin Moncure wrote:
> 2009/11/13 Heikki Linnakangas :
>> Laszlo Nagy wrote:
>>>* I need at least 32GB disk space. So DRAM based SSD is not a real
>>>  option. I would have to buy 8x4GB memory, costs a fortune. And
>>>  then it would still not have redundancy.
>> At 32GB database size, I'd seriously consider just buying a server with
>> a regular hard drive or a small RAID array for redundancy, and stuffing
>> 16 or 32 GB of RAM into it to ensure everything is cached. That's tried
>> and tested technology.
> 
> lots of ram doesn't help you if:
> *) your database gets written to a lot and you have high performance
> requirements

When all the (hot) data is cached, all writes are sequential writes to
the WAL, with the occasional flushing of the data pages at checkpoint.
The sequential write bandwidth of SSDs and HDDs is roughly the same.

I presume the fsync latency is a lot higher with HDDs, so if you're
running a lot of small write transactions, and don't want to risk losing
any recently committed transactions by setting synchronous_commit=off,
the usual solution is to get a RAID controller with a battery-backed up
cache. With a BBU cache, the fsync latency should be in the same
ballpark as with SDDs.

> *) your data is important

Huh? The data is safely on the hard disk in case of a crash. The RAM is
just for caching.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Ivan Voras

Lists wrote:

Laszlo Nagy wrote:

Hello,

I'm about to buy SSD drive(s) for a database. For decision making, I 
used this tech report:


http://techreport.com/articles.x/16255/9
http://techreport.com/articles.x/16255/10

Here are my concerns:

   * I need at least 32GB disk space. So DRAM based SSD is not a real
 option. I would have to buy 8x4GB memory, costs a fortune. And
 then it would still not have redundancy.
   * I could buy two X25-E drives and have 32GB disk space, and some
 redundancy. This would cost about $1600, not counting the RAID
 controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). Not 
a cheap solution, but the performance is amazing.


I've came across this article:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/

It's from a Linux MySQL user so it's a bit confusing but it looks like 
he has some reservations about performance vs reliability of the Intel 
drives - apparently they have their own write cache and when it's 
disabled performance drops sharply.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-14 Thread Lists

Laszlo Nagy wrote:

Hello,

I'm about to buy SSD drive(s) for a database. For decision making, I 
used this tech report:


http://techreport.com/articles.x/16255/9
http://techreport.com/articles.x/16255/10

Here are my concerns:

   * I need at least 32GB disk space. So DRAM based SSD is not a real
 option. I would have to buy 8x4GB memory, costs a fortune. And
 then it would still not have redundancy.
   * I could buy two X25-E drives and have 32GB disk space, and some
 redundancy. This would cost about $1600, not counting the RAID
 controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). Not 
a cheap solution, but the performance is amazing.



   * I could also buy many cheaper MLC SSD drives. They cost about
 $140. So even with 10 drives, I'm at $1400. I could put them in
 RAID6, have much more disk space (256GB), high redundancy and
 POSSIBLY good read/write speed. Of course then I need to buy a
 good RAID controller.

My question is about the last option. Are there any good RAID cards 
that are optimized (or can be optimized) for SSD drives? Do any of you 
have experience in using many cheaper SSD drives? Is it a bad idea?


Thank you,

  Laszlo





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-13 Thread Kenny Gorman
The FusionIO products are a little different.  They are card based vs trying to 
emulate a traditional disk.  In terms of volatility, they have an on-board 
capacitor that allows power to be supplied until all writes drain.  They do not 
have a cache in front of them like a disk-type SSD might.   I don't sell these 
things, I am just a fan.  I verified all this with the Fusion IO techs before I 
replied.  Perhaps older versions didn't have this functionality?  I am not 
sure.  I have already done some cold power off tests w/o problems, but I could 
up the workload a bit and retest.  I will do a couple of 'pull the cable' tests 
on monday or tuesday and report back how it goes.

Re the performance #'s...  Here is my post:

http://www.kennygorman.com/wordpress/?p=398

-kg

 
>In order for a drive to work reliably for database use such as for 
>PostgreSQL, it cannot have a volatile write cache.  You either need a 
>write cache with a battery backup (and a UPS doesn't count), or to turn 
>the cache off.  The SSD performance figures you've been looking at are 
>with the drive's write cache turned on, which means they're completely 
>fictitious and exaggerated upwards for your purposes.  In the real 
>world, that will result in database corruption after a crash one day.  
>No one on the drive benchmarking side of the industry seems to have 
>picked up on this, so you can't use any of those figures.  I'm not even 
>sure right now whether drives like Intel's will even meet their lifetime 
>expectations if they aren't allowed to use their internal volatile write 
>cache.
>
>Here's two links you should read and then reconsider your whole design: 
>
>http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
>http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html
>
>I can't even imagine how bad the situation would be if you decide to 
>wander down the "use a bunch of really cheap SSD drives" path; these 
>things are barely usable for databases with Intel's hardware.  The needs 
>of people who want to throw SSD in a laptop and those of the enterprise 
>database market are really different, and if you believe doom 
>forecasting like the comments at 
>http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc 
>that gap is widening, not shrinking.



Re: [PERFORM] SSD + RAID

2009-11-13 Thread Greg Smith

Fernando Hevia wrote:

Shouldn't their write performance be more than a trade-off for fsync?
  
Not if you have sequential writes that are regularly fsync'd--which is 
exactly how the WAL writes things out in PostgreSQL.  I think there's a 
potential for SSD to reach a point where they can give good performance 
even with their write caches turned off.  But it will require a more 
robust software stack, like filesystems that really implement the write 
barrier concept effectively for this use-case, for that to happen.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-13 Thread Merlin Moncure
2009/11/13 Greg Smith :
> As far as what real-world apps have that profile, I like SSDs for small to
> medium web applications that have to be responsive, where the user shows up
> and wants their randomly distributed and uncached data with minimal latency.
> SSDs can also be used effectively as second-tier targeted storage for things
> that have a performance-critical but small and random bit as part of a
> larger design that doesn't have those characteristics; putting indexes on
> SSD can work out well for example (and there the write durability stuff
> isn't quite as critical, as you can always drop an index and rebuild if it
> gets corrupted).


Here's a bonnie++ result for Intel showing 14k seeks:
http://www.wlug.org.nz/HarddiskBenchmarks

bonnie++ only writes data back 10% of the time.  Why is Peter's
benchmark showing only 400 seeks? Is this all attributable to write
barrier? I'm not sure I'm buying that...

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-13 Thread Greg Smith

Brad Nicholson wrote:
Out of curiosity, what are those narrow use cases where you think 
SSD's are the correct technology?

Dave Crooke did a good summary already, I see things like this:

* You need to have a read-heavy app that's bigger than RAM, but not too 
big so it can still fit on SSD
* You need reads to be dominated by random-access and uncached lookups, 
so that system RAM used as a buffer cache doesn't help you much.
* Writes have to be low to moderate, as the true write speed is much 
lower for database use than you'd expect from benchmarks derived from 
other apps.  And it's better if writes are biased toward adding data 
rather than changing existing pages


As far as what real-world apps have that profile, I like SSDs for small 
to medium web applications that have to be responsive, where the user 
shows up and wants their randomly distributed and uncached data with 
minimal latency. 

SSDs can also be used effectively as second-tier targeted storage for 
things that have a performance-critical but small and random bit as part 
of a larger design that doesn't have those characteristics; putting 
indexes on SSD can work out well for example (and there the write 
durability stuff isn't quite as critical, as you can always drop an 
index and rebuild if it gets corrupted).


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] SSD + RAID

2009-11-13 Thread Fernando Hevia
 

> -Mensaje original-
> Laszlo Nagy
> 
> My question is about the last option. Are there any good RAID 
> cards that are optimized (or can be optimized) for SSD 
> drives? Do any of you have experience in using many cheaper 
> SSD drives? Is it a bad idea?
> 
> Thank you,
> 
>Laszlo
> 

Never had a SSD to try yet, still I wonder if software raid + fsync on SSD
Drives could be regarded as a sound solution?
Shouldn't their write performance be more than a trade-off for fsync?

You could benchmark this setup yourself before purchasing a RAID card.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


  1   2   >