Re: hypothetical question about data storage

2013-07-26 Thread Johan De Meersman
Hey Chris,

I'm afraid that this is not what databases are for, and the first thing you'll 
likely run into is amount of concurrent connections.

This is typically something you should really tackle from a systems 
perspective. Seek times are dramatically improved on SSD or similar storage - 
think FusionIO cards, but there's also a couple of vendors (Violin comes to 
mind) who provide full-blown SSD SANs.

If you prefer staying with spinning disks, you could still improve the seeks by 
focusing on the inner cylinders and potentially by using variable sector 
formatting. Again, there's SANs that do this for you.

Another minor trick is to turn off access timestamp updates when you mount the 
filesystem (noatime).

Also benchmark different filesystems, there's major differences between them. 
I've heard XFS being recommended, but I've never needed to benchmark for seek 
times myself. We're using IBM's commercial GPFS here, which is good with 
enormous amounts of huge files (media farm here), not sure how it'd fare with 
smaller files.

Hope that helps,
Johan

- Original Message -
 From: Chris Knipe sav...@savage.za.org
 To: mysql@lists.mysql.com
 Sent: Thursday, 25 July, 2013 11:53:53 PM
 Subject: hypothetical question about data storage
 
 Hi all,
 
 We run an VERY io intensive file application service.  Currently, our
 problem is that our disk spindles are being completely killed due to
 insufficient SEEK time on the hard drives (NOT physical read/write
 speeds).
 
 We have an directory structure where the files are stored based on
 the MD5
 checksum of the file name, i.e.
 /0/00/000/44533779fce5cf3497f87de1d060
 The majority of these files, are between 256K and 800K with the ODD
 exception (say less than 15%) being more than 1M but no more than 5M
 in
 size.  The content of the files are pure text (MIME Encoded).
 
 We believe that storing these files into an InnoDB table, may
 actually give
 us better performance:
 - There is one large file that is being read/written, instead of
 BILLIONS of
 small files
 - We can split the structure so that each directory (4096 in total)
 sit's on
 their own database
 - We can move the databases as load increases, which means that we
 can
 potentially run 2 physical database servers, each with 2048 databases
 each)
 - It's easy to move / migrate the data due to mysql and replication -
 same
 can be said for redundancy of the data
 
 We are more than likely looking at BLOB columns of course, and we
 need to
 read/write from the DB in excess of 100mbit/s
 
 Would the experts consider something like this as being feasible?  Is
 it
 worth it to go down this avenue, or are we just going to run into
 different
 problems?  If we are facing different problems, what can we possibly
 expect
 to go wrong here?
 
 Many thanks, and I look forward to any input.
 

-- 
Unhappiness is discouraged and will be corrected with kitten pictures.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-26 Thread Chris Knipe
Hi All,

Thanks for the responces, and I do concur.  I was taking a stab in the
dark so to speak.

We are working with our hosting providers currently and will be
introducing a multitude of small iSCSI SANs to split the storage
structure over a multitude of disks...   This is something that needs
to be addressed from a systems perspective rather than an
architectural one.

SSD (or Fusion and the like) are unfortunately still way to expensive
for the capacity that we require (good couple of TBs) - so mechanical
disks it would need to be.  However, with the use of SANs as we hope,
we should be able to go up from 4 to over 64 spindles whilst still
being able to share the storage and have redundancy.

Many thanks for the inputs and feedbacks...

--
C


On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman vegiv...@tuxera.be wrote:
 Hey Chris,

 I'm afraid that this is not what databases are for, and the first thing 
 you'll likely run into is amount of concurrent connections.

 This is typically something you should really tackle from a systems 
 perspective. Seek times are dramatically improved on SSD or similar storage - 
 think FusionIO cards, but there's also a couple of vendors (Violin comes to 
 mind) who provide full-blown SSD SANs.

 If you prefer staying with spinning disks, you could still improve the seeks 
 by focusing on the inner cylinders and potentially by using variable sector 
 formatting. Again, there's SANs that do this for you.

 Another minor trick is to turn off access timestamp updates when you mount 
 the filesystem (noatime).

 Also benchmark different filesystems, there's major differences between them. 
 I've heard XFS being recommended, but I've never needed to benchmark for seek 
 times myself. We're using IBM's commercial GPFS here, which is good with 
 enormous amounts of huge files (media farm here), not sure how it'd fare with 
 smaller files.

 Hope that helps,
 Johan

 - Original Message -
 From: Chris Knipe sav...@savage.za.org
 To: mysql@lists.mysql.com
 Sent: Thursday, 25 July, 2013 11:53:53 PM
 Subject: hypothetical question about data storage

 Hi all,

 We run an VERY io intensive file application service.  Currently, our
 problem is that our disk spindles are being completely killed due to
 insufficient SEEK time on the hard drives (NOT physical read/write
 speeds).

 We have an directory structure where the files are stored based on
 the MD5
 checksum of the file name, i.e.
 /0/00/000/44533779fce5cf3497f87de1d060
 The majority of these files, are between 256K and 800K with the ODD
 exception (say less than 15%) being more than 1M but no more than 5M
 in
 size.  The content of the files are pure text (MIME Encoded).

 We believe that storing these files into an InnoDB table, may
 actually give
 us better performance:
 - There is one large file that is being read/written, instead of
 BILLIONS of
 small files
 - We can split the structure so that each directory (4096 in total)
 sit's on
 their own database
 - We can move the databases as load increases, which means that we
 can
 potentially run 2 physical database servers, each with 2048 databases
 each)
 - It's easy to move / migrate the data due to mysql and replication -
 same
 can be said for redundancy of the data

 We are more than likely looking at BLOB columns of course, and we
 need to
 read/write from the DB in excess of 100mbit/s

 Would the experts consider something like this as being feasible?  Is
 it
 worth it to go down this avenue, or are we just going to run into
 different
 problems?  If we are facing different problems, what can we possibly
 expect
 to go wrong here?

 Many thanks, and I look forward to any input.


 --
 Unhappiness is discouraged and will be corrected with kitten pictures.



-- 

Regards,
Chris Knipe

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



RE: hypothetical question about data storage

2013-07-26 Thread Rick James
Count the disk hits

If you have a filesystem directory, consider that it is designed to handle 
small numbers of files per directory.  Consider that there is a limited cache 
for directories, etc.  Plus there is the inode (vnode, whatever) storage for 
each file.  I don't know the details (and it varies wildly with filesystem 
(ext, xfs, zfs, etc)).

Looking at InnoDB...

Let's say you have a billion rows in a single table, and you need to fetch one 
row by the PRIMARY KEY, and it is a MD5 (sha-1, UUID, etc).  Such a key is 
_very_ random.

A billion rows would need about 5 levels of BTree.  The top levels would 
quickly all be cached.  (100M blocks * 16KB = 1.6GB.)  If the leaf nodes add up 
to 200GB, that is probably bigger than you innodb_buffer_pool_size.  In that 
case, a _random_ fetch is likely to be a cache miss.

A cache miss is about 100ms on normal rotating-media; perhaps 10ms on SSDs.  
This limits your reads to 10 (or 100) per second.

If you have big BLOBs in the table, then it gets messier.  InnoDB does not put 
more than 8K of a row in the actual 16KB block.  The rest is stored in another 
block(s).  So, it is likely to take an extra disk hit (200ms/20ms).

If your data size is 100 times as big as your buffer pool, then it becomes 
likely that the next level of the BTree won't be fully cacheable.  Now 
300ms/30ms.

I think it is likely that the small number of disk hits for InnoDB is better 
than the many disk hits for traversing a directory tree (with large 
directories) in the filesystem.  I vote for InnoDB over the directory tree.

Yes, you will have seeks.

No, adding more RAM won't help much.  Here's an argument:
Suppose your data is 20 times as big as the buffer pool and you are doing 
random fetches (MD5, etc).  Then 1/20 of fetches are cached; 95% cache miss.  
Estimated time: 0.95 * 100ms = 95ms.
Now you double your RAM.  1/10 cached - 90% cache miss - 90ms average - Not 
much improvement over 95.

 -Original Message-
 From: ckn...@savage.za.org [mailto:ckn...@savage.za.org] On Behalf Of
 Chris Knipe
 Sent: Friday, July 26, 2013 12:30 AM
 To: Johan De Meersman
 Cc: mysql
 Subject: Re: hypothetical question about data storage
 
 Hi All,
 
 Thanks for the responces, and I do concur.  I was taking a stab in the
 dark so to speak.
 
 We are working with our hosting providers currently and will be
 introducing a multitude of small iSCSI SANs to split the storage
 structure over a multitude of disks...   This is something that needs
 to be addressed from a systems perspective rather than an architectural
 one.
 
 SSD (or Fusion and the like) are unfortunately still way to expensive for
 the capacity that we require (good couple of TBs) - so mechanical disks it
 would need to be.  However, with the use of SANs as we hope, we should be
 able to go up from 4 to over 64 spindles whilst still being able to share
 the storage and have redundancy.
 
 Many thanks for the inputs and feedbacks...
 
 --
 C
 
 
 On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman vegiv...@tuxera.be
 wrote:
  Hey Chris,
 
  I'm afraid that this is not what databases are for, and the first thing
 you'll likely run into is amount of concurrent connections.
 
  This is typically something you should really tackle from a systems
 perspective. Seek times are dramatically improved on SSD or similar
 storage - think FusionIO cards, but there's also a couple of vendors
 (Violin comes to mind) who provide full-blown SSD SANs.
 
  If you prefer staying with spinning disks, you could still improve the
 seeks by focusing on the inner cylinders and potentially by using variable
 sector formatting. Again, there's SANs that do this for you.
 
  Another minor trick is to turn off access timestamp updates when you
 mount the filesystem (noatime).
 
  Also benchmark different filesystems, there's major differences between
 them. I've heard XFS being recommended, but I've never needed to benchmark
 for seek times myself. We're using IBM's commercial GPFS here, which is
 good with enormous amounts of huge files (media farm here), not sure how
 it'd fare with smaller files.
 
  Hope that helps,
  Johan
 
  - Original Message -
  From: Chris Knipe sav...@savage.za.org
  To: mysql@lists.mysql.com
  Sent: Thursday, 25 July, 2013 11:53:53 PM
  Subject: hypothetical question about data storage
 
  Hi all,
 
  We run an VERY io intensive file application service.  Currently, our
  problem is that our disk spindles are being completely killed due to
  insufficient SEEK time on the hard drives (NOT physical read/write
  speeds).
 
  We have an directory structure where the files are stored based on
  the MD5 checksum of the file name, i.e.
  /0/00/000/44533779fce5cf3497f87de1d060
  The majority of these files, are between 256K and 800K with the ODD
  exception (say less than 15%) being more than 1M but no more than 5M
  in size.  The content of the files are pure text (MIME Encoded).
 
  We believe that storing these 

RE: hypothetical question about data storage

2013-07-26 Thread Johan De Meersman
Your argument against FS assumes that you don't know the exact filename 
(directory traversals), but your argument for InnoDB assumes that you do (index 
lookup). Apples and oranges.

Besides, the venerable ext2 handled up to a couple of tens of thousands of 
files per directory smoothly when listing; things have only improved since 
then. Small amounts is a very relative concept.

Rick James rja...@yahoo-inc.com wrote:
Count the disk hits

If you have a filesystem directory, consider that it is designed to
handle small numbers of files per directory.  Consider that there is a
limited cache for directories, etc.  Plus there is the inode (vnode,
whatever) storage for each file.  I don't know the details (and it
varies wildly with filesystem (ext, xfs, zfs, etc)).

Looking at InnoDB...

Let's say you have a billion rows in a single table, and you need to
fetch one row by the PRIMARY KEY, and it is a MD5 (sha-1, UUID, etc). 
Such a key is _very_ random.

A billion rows would need about 5 levels of BTree.  The top levels
would quickly all be cached.  (100M blocks * 16KB = 1.6GB.)  If the
leaf nodes add up to 200GB, that is probably bigger than you
innodb_buffer_pool_size.  In that case, a _random_ fetch is likely to
be a cache miss.

A cache miss is about 100ms on normal rotating-media; perhaps 10ms on
SSDs.  This limits your reads to 10 (or 100) per second.

If you have big BLOBs in the table, then it gets messier.  InnoDB does
not put more than 8K of a row in the actual 16KB block.  The rest is
stored in another block(s).  So, it is likely to take an extra disk hit
(200ms/20ms).

If your data size is 100 times as big as your buffer pool, then it
becomes likely that the next level of the BTree won't be fully
cacheable.  Now 300ms/30ms.

I think it is likely that the small number of disk hits for InnoDB is
better than the many disk hits for traversing a directory tree (with
large directories) in the filesystem.  I vote for InnoDB over the
directory tree.

Yes, you will have seeks.
   
No, adding more RAM won't help much.  Here's an argument:
Suppose your data is 20 times as big as the buffer pool and you are
doing random fetches (MD5, etc).  Then 1/20 of fetches are cached; 95%
cache miss.  Estimated time: 0.95 * 100ms = 95ms.
Now you double your RAM.  1/10 cached - 90% cache miss - 90ms average
- Not much improvement over 95.

 -Original Message-
 From: ckn...@savage.za.org [mailto:ckn...@savage.za.org] On Behalf Of
 Chris Knipe
 Sent: Friday, July 26, 2013 12:30 AM
 To: Johan De Meersman
 Cc: mysql
 Subject: Re: hypothetical question about data storage
 
 Hi All,
 
 Thanks for the responces, and I do concur.  I was taking a stab in
the
 dark so to speak.
 
 We are working with our hosting providers currently and will be
 introducing a multitude of small iSCSI SANs to split the storage
 structure over a multitude of disks...   This is something that needs
 to be addressed from a systems perspective rather than an
architectural
 one.
 
 SSD (or Fusion and the like) are unfortunately still way to expensive
for
 the capacity that we require (good couple of TBs) - so mechanical
disks it
 would need to be.  However, with the use of SANs as we hope, we
should be
 able to go up from 4 to over 64 spindles whilst still being able to
share
 the storage and have redundancy.
 
 Many thanks for the inputs and feedbacks...
 
 --
 C
 
 
 On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman
vegiv...@tuxera.be
 wrote:
  Hey Chris,
 
  I'm afraid that this is not what databases are for, and the first
thing
 you'll likely run into is amount of concurrent connections.
 
  This is typically something you should really tackle from a systems
 perspective. Seek times are dramatically improved on SSD or similar
 storage - think FusionIO cards, but there's also a couple of vendors
 (Violin comes to mind) who provide full-blown SSD SANs.
 
  If you prefer staying with spinning disks, you could still improve
the
 seeks by focusing on the inner cylinders and potentially by using
variable
 sector formatting. Again, there's SANs that do this for you.
 
  Another minor trick is to turn off access timestamp updates when
you
 mount the filesystem (noatime).
 
  Also benchmark different filesystems, there's major differences
between
 them. I've heard XFS being recommended, but I've never needed to
benchmark
 for seek times myself. We're using IBM's commercial GPFS here, which
is
 good with enormous amounts of huge files (media farm here), not sure
how
 it'd fare with smaller files.
 
  Hope that helps,
  Johan
 
  - Original Message -
  From: Chris Knipe sav...@savage.za.org
  To: mysql@lists.mysql.com
  Sent: Thursday, 25 July, 2013 11:53:53 PM
  Subject: hypothetical question about data storage
 
  Hi all,
 
  We run an VERY io intensive file application service.  Currently,
our
  problem is that our disk spindles are being completely killed due
to
  insufficient SEEK time on the hard drives (NOT physical read/write
  

Re: hypothetical question about data storage

2013-07-26 Thread Chris Knipe
Well that information I can provide

As mentioned, we use an md5 (hex) checksum to track the files.  In
terms of the tables, I would definately consider the md5 checksum as a
PK (char(32) due to the hex nature), a blob for the data, and then
there will also be a datetime column to indicate when last the file
was accessed.  We already use mySQL with 4096 innodb tables in a
single database to track the timestamp when the file was last accessed
(noatime on the file system) - and it's working remarkably well.  But
I do understand that adding the blob will most certainly change things
due to the fact that much more data needs to be moved arround (memory,
disk and networking layers).

Currently the server (32GB ram, dedicated 2 x quad core xeon) is
pretty much idling in terms of load, doing approximately 100
transactions per second with less than 50 out of a max of 500
connections configured (binlogs and all). Note however that this is
just inserting new records, and updating the last accessed timestamp
on the records.  Sometimes, there are large delete transactions
running as well to remove expired files (start transaction; delete
from... ; commit)

We store the files in 4096 unique directories, the files are
structured very simply as follows:
0/00/000/000a242bf...
1/10/10A/10aa342
F/FF/FFA234234

1st char/1st + 2nd char/1st + 2nd + 3rd char/filename being the
hex md5 checksum

Currently, there is about 3TB worth of data, this -will- grow easily
to 10 times in size over time (200GB to 300GB per day).  Per 1TB of
data, we are basically looking at 1m files (or records) split on
average at 243 files per directory (or table).  Projections at 30TB
would indicate +- 32m files (or records) split on average at 8k per
directory or table.  Personally, I don't think it will be worth our
while to go over the 64TB mark, which means 64m records with 16k
records per table.

In terms of scaling, if we use two physical mysql servers, we're
looking at 2048 tables per server, four servers being 1024 tables, 8
servers being 512 tables per server.  It's relatively easy to
determine from an application point of view, which database server /
database name, and table to query using the same principals that we
are using in terms of storing the files in the associated directory
structure.

The issue that we have identified is caused by seek time - hundreds of
clients simultaneously searching for a single file.  The only real way
to explain this is to run 100 concurrent instances of bonnie++ doing
random read/writes... Your disk utilization and disk latency
essentially goes through the roof resulting in IO wait and insanely
high load averages (we've seen it spike to over 150 on a 8-core Xeon -
at which time the application (at a 40 load average already) stops
processing requests to prevent the server crashing).

We are currently busy deploying a small SAN (iSCSI) for testing
changes to the underlying file system, but one part of me believes it
won't help much, whilst the other half is extremely optimistic...
We're also with the SANs splitting the structure so that each SAN only
caters for a certain amount of parent directories.  We're doing 2 SANs
with 2048 directory sets per SAN



On Sat, Jul 27, 2013 at 12:19 AM, Johan De Meersman vegiv...@tuxera.be wrote:
 Your argument against FS assumes that you don't know the exact filename 
 (directory traversals), but your argument for InnoDB assumes that you do 
 (index lookup). Apples and oranges.

 Besides, the venerable ext2 handled up to a couple of tens of thousands of 
 files per directory smoothly when listing; things have only improved since 
 then. Small amounts is a very relative concept.

 Rick James rja...@yahoo-inc.com wrote:
Count the disk hits

If you have a filesystem directory, consider that it is designed to
handle small numbers of files per directory.  Consider that there is a
limited cache for directories, etc.  Plus there is the inode (vnode,
whatever) storage for each file.  I don't know the details (and it
varies wildly with filesystem (ext, xfs, zfs, etc)).

Looking at InnoDB...

Let's say you have a billion rows in a single table, and you need to
fetch one row by the PRIMARY KEY, and it is a MD5 (sha-1, UUID, etc).
Such a key is _very_ random.

A billion rows would need about 5 levels of BTree.  The top levels
would quickly all be cached.  (100M blocks * 16KB = 1.6GB.)  If the
leaf nodes add up to 200GB, that is probably bigger than you
innodb_buffer_pool_size.  In that case, a _random_ fetch is likely to
be a cache miss.

A cache miss is about 100ms on normal rotating-media; perhaps 10ms on
SSDs.  This limits your reads to 10 (or 100) per second.

If you have big BLOBs in the table, then it gets messier.  InnoDB does
not put more than 8K of a row in the actual 16KB block.  The rest is
stored in another block(s).  So, it is likely to take an extra disk hit
(200ms/20ms).

If your data size is 100 times as big as your buffer pool, then it
becomes likely that 

Re: hypothetical question about data storage

2013-07-26 Thread hsv
 2013/07/27 00:58 +0200, Chris Knipe 
I would definately consider the md5 checksum as a
PK (char(32) due to the hex nature), 

Well, not that it greatly matters, but you could convert it to BINARY(16).


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql