RE: hypothetical question about data storage

2013-07-30 Thread Johan De Meersman
Rick James  wrote:
>
>When writing a random block, RAID-5 does not need to touch all the
>drives, only the one with parity.  Suitable XORs will update it
>correctly.  So, a write hits 2 drives, whether you have RAID-5 or -10.

Only if the other blocks happen to be in the cache, otherwise a read is 
required. We performed the tests when we last added storage to our media farm.

On the other hand I'm not too familiar wit the actual algorithms used, so our 
findings might be vendor-dependant.




-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-29 Thread Manuel Arostegui
2013/7/30 Rick James 

> Elevator...  If the RAID _controller_ does the Elevator stuff, any OS
> optimizations are wasted.
> And there have been benchmarks backing that up.  (Sorry, don't have any
> links handy.)
>
> RAID 5/10 ...  The testing I have done shows very little difference.
>  However, you can slant the conclusion by picking one versus the other of:
> "For a given amount of disk space... RAID-X is better than Y."
> "For a given number of drives... RAID-Y is better than X."


The tests I have done with RAID5 vs RAID10 the difference is huge, at least
in our clusters with heavy writes.
We usually do RAIDS over 4 or 8 SAS disks (15krpm).
The performance of each type of RAID needs to be tested for your concrete
scenario, you can find lot of benchmarks out there, but you need to test
your workload to be sure what works better for you. As Rick said, with
BBUs, disk schedulers, write back/write thru configuration etc things can
change.

The last tests with SSD disks shows no difference, so for the new servers
with SSD we're going for RAID5 as you get more disk space :-)

Just my 2 cents!
Manuel.


-- 
Manuel Aróstegui
Systems Team
tuenti.com


Re: hypothetical question about data storage

2013-07-29 Thread Carsten Pedersen

On 30-07-2013 01:16, Rick James wrote:

Elevator...  If the RAID _controller_ does the Elevator stuff, any OS
optimizations are wasted. And there have been benchmarks backing that
up.  (Sorry, don't have any links handy.)

RAID 5/10 ...  The testing I have done shows very little difference.


...right up to the day one of the disks fail, and you thought you could 
just plug in a new spindle and let the system take care of the rest...


http://www.miracleas.com/BAARF/
http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt

/ Carsten

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



RE: hypothetical question about data storage

2013-07-29 Thread Rick James
Elevator...  If the RAID _controller_ does the Elevator stuff, any OS 
optimizations are wasted.
And there have been benchmarks backing that up.  (Sorry, don't have any links 
handy.)

RAID 5/10 ...  The testing I have done shows very little difference.  However, 
you can slant the conclusion by picking one versus the other of:
"For a given amount of disk space... RAID-X is better than Y."
"For a given number of drives... RAID-Y is better than X."

When writing a random block, RAID-5 does not need to touch all the drives, only 
the one with parity.  Suitable XORs will update it correctly.  So, a write hits 
2 drives, whether you have RAID-5 or -10.

Some people make the chunk size 64KB (etc); not 512B.  With the Controller 
involved, there is not necessarily any benefit for large vs small chunk size.  
Writes are delayed until the it is optimal.  This leads to large streaming 
writes to each drive, regardless of chunk size (when writing a large stream).

A heavily used InnoDB system will be writing random 16KB blocks.

(I have no insight into RAID-6.)

> -Original Message-
> From: Johan De Meersman [mailto:vegiv...@tuxera.be]
> Sent: Monday, July 29, 2013 3:38 PM
> To: Rick James; will...@techservsys.com; mysql@lists.mysql.com
> Subject: RE: hypothetical question about data storage
> 
> Rick James  wrote:
> >
> >For MySQL + RAID, a Linux elevator strategy of 'deadline' or 'noop' is
> >optimal.  (The default, 'cfq', is not as good.)
> 
> I should look into those again at some point. Do you have a brief word as
> to why they're better?
> 
> 
> >A RAID controller with multiple drives striped (and optionally
> >parity-checked) (RAID-5, -10) and with a BBU (Battery Backed Write
> >Cache) is excellent for I/O.
> 
> Very true. 10 is traditionally considered better - it's certainly faster -
> but 5 is of course cheaper :-)
> 
> I'd like to add that 4+1 is the optimal configuration for RAID5 , as that
> makes for a stripe of 2kb, assuming 512b sectors of course. You then pick
> an fs that supports blocks of that size , which means that no write will
> ever need to perform a read first to calculate the checksum.
> 
> 
> 
> 
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.


RE: hypothetical question about data storage

2013-07-29 Thread Johan De Meersman
Rick James  wrote:
>
>For MySQL + RAID, a Linux elevator strategy of 'deadline' or 'noop' is
>optimal.  (The default, 'cfq', is not as good.)

I should look into those again at some point. Do you have a brief word as to 
why they're better?


>A RAID controller with multiple drives striped (and optionally
>parity-checked) (RAID-5, -10) and with a BBU (Battery Backed Write
>Cache) is excellent for I/O.

Very true. 10 is traditionally considered better - it's certainly faster - but 
5 is of course cheaper :-) 

I'd like to add that 4+1 is the optimal configuration for RAID5 , as that makes 
for a stripe of 2kb, assuming 512b sectors of course. You then pick an fs that 
supports blocks of that size , which means that no write will ever need to 
perform a read first to calculate the checksum.




-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



RE: hypothetical question about data storage

2013-07-29 Thread Rick James
Most RAID controllers will happily do Elevator stuff like you mentioned.
So will Linux.

For MySQL + RAID, a Linux elevator strategy of 'deadline' or 'noop' is optimal. 
 (The default, 'cfq', is not as good.)

A RAID controller with multiple drives striped (and optionally parity-checked) 
(RAID-5, -10) and with a BBU (Battery Backed Write Cache) is excellent for I/O.

I don't know about "chronologically later".  InnoDB "does the right thing", as 
long as the OS does not cheat on fsync, etc.

> 1/10/10A/10aa342
Only 16 subdirectories per directory?  I would expect 256 to be more efficient 
overall.  This is because of fewer levels.  Scanning 256 is probably less 
costly than doing an extra level.  (Yeah, again, I can't _prove_ it in _your_ 
environment.)

4K tables on a single machine -- that is beginning to get into 'big' in 
reference to ulimit, table_open_cache, etc.  That is, if you went much past 
that, you would be getting into new areas of inefficiency.

I do not like splitting a database "table" into multiple tables, except by 
PARTITIONing.  PARTITIONing would also provide a 'instantaneous' way of purging 
old data.  (DROP PARTITION + REORGANIZE PARTITION)

Almost always (again no proof for your case), a single table is more efficient 
than many tables.  This applies to PARTITIONing, too, but there are can be 
other gains by using PARTITIONing.

InnoDB has a 64TB limit per PARTITION.

> -Original Message-
> From: william drescher [mailto:will...@techservsys.com]
> Sent: Saturday, July 27, 2013 4:32 AM
> To: mysql@lists.mysql.com
> Subject: Re: hypothetical question about data storage
> 
> On 7/26/2013 6:58 PM, Chris Knipe wrote:
> > The issue that we have identified is caused by seek time - hundreds of
> > clients simultaneously searching for a single file.  The only real way
> > to explain this is to run 100 concurrent instances of bonnie++ doing
> > random read/writes... Your disk utilization and disk latency
> > essentially goes through the roof resulting in IO wait and insanely
> > high load averages (we've seen it spike to over 150 on a 8-core Xeon -
> > at which time the application (at a 40 load average already) stops
> > processing requests to prevent the server crashing).
> 
> back in the day (many years ago) when I worked for IBM we had disk
> controllers that would queue and sort pending reads so that the heads
> would seek from low tracks across the disk to high tracks and then back to
> low. This resulted in very low seek _averages_.
> The controller was smart enough to make sure that if a write occurred,
> chronologically later reads got the right data, even if it had not been
> physically written to disk yet.
> 
> Is there such a controller available now?
> 
> bill
> 
> 
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:http://lists.mysql.com/mysql


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-27 Thread william drescher

On 7/26/2013 6:58 PM, Chris Knipe wrote:

The issue that we have identified is caused by seek time - hundreds of
clients simultaneously searching for a single file.  The only real way
to explain this is to run 100 concurrent instances of bonnie++ doing
random read/writes... Your disk utilization and disk latency
essentially goes through the roof resulting in IO wait and insanely
high load averages (we've seen it spike to over 150 on a 8-core Xeon -
at which time the application (at a 40 load average already) stops
processing requests to prevent the server crashing).


back in the day (many years ago) when I worked for IBM we had 
disk controllers that would queue and sort pending reads so that 
the heads would seek from low tracks across the disk to high 
tracks and then back to low. This resulted in very low seek 
_averages_.
The controller was smart enough to make sure that if a write 
occurred, chronologically later reads got the right data, even if 
it had not been physically written to disk yet.


Is there such a controller available now?

bill


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-26 Thread hsv
 2013/07/27 00:58 +0200, Chris Knipe 
I would definately consider the md5 checksum as a
PK (char(32) due to the hex nature), 

Well, not that it greatly matters, but you could convert it to BINARY(16).


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-26 Thread Chris Knipe
n it gets messier.  InnoDB does
>>not put more than 8K of a row in the actual 16KB block.  The rest is
>>stored in another block(s).  So, it is likely to take an extra disk hit
>>(200ms/20ms).
>>
>>If your data size is 100 times as big as your buffer pool, then it
>>becomes likely that the next level of the BTree won't be fully
>>cacheable.  Now 300ms/30ms.
>>
>>I think it is likely that the small number of disk hits for InnoDB is
>>better than the many disk hits for traversing a directory tree (with
>>large directories) in the filesystem.  I vote for InnoDB over the
>>directory tree.
>>
>>Yes, you will have seeks.
>>
>>No, adding more RAM won't help much.  Here's an argument:
>>Suppose your data is 20 times as big as the buffer pool and you are
>>doing random fetches (MD5, etc).  Then 1/20 of fetches are cached; 95%
>>cache miss.  Estimated time: 0.95 * 100ms = 95ms.
>>Now you double your RAM.  1/10 cached -> 90% cache miss -> 90ms average
>>-> Not much improvement over 95.
>>
>>> -Original Message-
>>> From: ckn...@savage.za.org [mailto:ckn...@savage.za.org] On Behalf Of
>>> Chris Knipe
>>> Sent: Friday, July 26, 2013 12:30 AM
>>> To: Johan De Meersman
>>> Cc: mysql
>>> Subject: Re: hypothetical question about data storage
>>>
>>> Hi All,
>>>
>>> Thanks for the responces, and I do concur.  I was taking a stab in
>>the
>>> dark so to speak.
>>>
>>> We are working with our hosting providers currently and will be
>>> introducing a multitude of small iSCSI SANs to split the storage
>>> structure over a multitude of disks...   This is something that needs
>>> to be addressed from a systems perspective rather than an
>>architectural
>>> one.
>>>
>>> SSD (or Fusion and the like) are unfortunately still way to expensive
>>for
>>> the capacity that we require (good couple of TBs) - so mechanical
>>disks it
>>> would need to be.  However, with the use of SANs as we hope, we
>>should be
>>> able to go up from 4 to over 64 spindles whilst still being able to
>>share
>>> the storage and have redundancy.
>>>
>>> Many thanks for the inputs and feedbacks...
>>>
>>> --
>>> C
>>>
>>>
>>> On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman
>>
>>> wrote:
>>> > Hey Chris,
>>> >
>>> > I'm afraid that this is not what databases are for, and the first
>>thing
>>> you'll likely run into is amount of concurrent connections.
>>> >
>>> > This is typically something you should really tackle from a systems
>>> perspective. Seek times are dramatically improved on SSD or similar
>>> storage - think FusionIO cards, but there's also a couple of vendors
>>> (Violin comes to mind) who provide full-blown SSD SANs.
>>> >
>>> > If you prefer staying with spinning disks, you could still improve
>>the
>>> seeks by focusing on the inner cylinders and potentially by using
>>variable
>>> sector formatting. Again, there's SANs that do this for you.
>>> >
>>> > Another minor trick is to turn off access timestamp updates when
>>you
>>> mount the filesystem (noatime).
>>> >
>>> > Also benchmark different filesystems, there's major differences
>>between
>>> them. I've heard XFS being recommended, but I've never needed to
>>benchmark
>>> for seek times myself. We're using IBM's commercial GPFS here, which
>>is
>>> good with enormous amounts of huge files (media farm here), not sure
>>how
>>> it'd fare with smaller files.
>>> >
>>> > Hope that helps,
>>> > Johan
>>> >
>>> > - Original Message -
>>> >> From: "Chris Knipe" 
>>> >> To: mysql@lists.mysql.com
>>> >> Sent: Thursday, 25 July, 2013 11:53:53 PM
>>> >> Subject: hypothetical question about data storage
>>> >>
>>> >> Hi all,
>>> >>
>>> >> We run an VERY io intensive file application service.  Currently,
>>our
>>> >> problem is that our disk spindles are being completely killed due
>>to
>>> >> insufficient SEEK time on the hard drives (NOT physical read/write
>>> >> speeds).
>>> >>
>>> >> We have an directory st

RE: hypothetical question about data storage

2013-07-26 Thread Johan De Meersman
Your argument against FS assumes that you don't know the exact filename 
(directory traversals), but your argument for InnoDB assumes that you do (index 
lookup). Apples and oranges.

Besides, the venerable ext2 handled up to a couple of tens of thousands of 
files per directory smoothly when listing; things have only improved since 
then. "Small amounts" is a very relative concept.

Rick James  wrote:
>"Count the disk hits"
>
>If you have a filesystem directory, consider that it is designed to
>handle small numbers of files per directory.  Consider that there is a
>limited cache for directories, etc.  Plus there is the inode (vnode,
>whatever) storage for each file.  I don't know the details (and it
>varies wildly with "filesystem" (ext, xfs, zfs, etc)).
>
>Looking at InnoDB...
>
>Let's say you have a billion rows in a single table, and you need to
>fetch one row by the PRIMARY KEY, and it is a MD5 (sha-1, UUID, etc). 
>Such a key is _very_ random.
>
>A billion rows would need about 5 levels of BTree.  The top levels
>would quickly all be cached.  (100M blocks * 16KB = 1.6GB.)  If the
>leaf nodes add up to 200GB, that is probably bigger than you
>innodb_buffer_pool_size.  In that case, a _random_ fetch is likely to
>be a cache miss.
>
>A cache miss is about 100ms on normal rotating-media; perhaps 10ms on
>SSDs.  This limits your reads to 10 (or 100) per second.
>
>If you have big BLOBs in the table, then it gets messier.  InnoDB does
>not put more than 8K of a row in the actual 16KB block.  The rest is
>stored in another block(s).  So, it is likely to take an extra disk hit
>(200ms/20ms).
>
>If your data size is 100 times as big as your buffer pool, then it
>becomes likely that the next level of the BTree won't be fully
>cacheable.  Now 300ms/30ms.
>
>I think it is likely that the small number of disk hits for InnoDB is
>better than the many disk hits for traversing a directory tree (with
>large directories) in the filesystem.  I vote for InnoDB over the
>directory tree.
>
>Yes, you will have seeks.
>   
>No, adding more RAM won't help much.  Here's an argument:
>Suppose your data is 20 times as big as the buffer pool and you are
>doing random fetches (MD5, etc).  Then 1/20 of fetches are cached; 95%
>cache miss.  Estimated time: 0.95 * 100ms = 95ms.
>Now you double your RAM.  1/10 cached -> 90% cache miss -> 90ms average
>-> Not much improvement over 95.
>
>> -Original Message-----
>> From: ckn...@savage.za.org [mailto:ckn...@savage.za.org] On Behalf Of
>> Chris Knipe
>> Sent: Friday, July 26, 2013 12:30 AM
>> To: Johan De Meersman
>> Cc: mysql
>> Subject: Re: hypothetical question about data storage
>> 
>> Hi All,
>> 
>> Thanks for the responces, and I do concur.  I was taking a stab in
>the
>> dark so to speak.
>> 
>> We are working with our hosting providers currently and will be
>> introducing a multitude of small iSCSI SANs to split the storage
>> structure over a multitude of disks...   This is something that needs
>> to be addressed from a systems perspective rather than an
>architectural
>> one.
>> 
>> SSD (or Fusion and the like) are unfortunately still way to expensive
>for
>> the capacity that we require (good couple of TBs) - so mechanical
>disks it
>> would need to be.  However, with the use of SANs as we hope, we
>should be
>> able to go up from 4 to over 64 spindles whilst still being able to
>share
>> the storage and have redundancy.
>> 
>> Many thanks for the inputs and feedbacks...
>> 
>> --
>> C
>> 
>> 
>> On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman
>
>> wrote:
>> > Hey Chris,
>> >
>> > I'm afraid that this is not what databases are for, and the first
>thing
>> you'll likely run into is amount of concurrent connections.
>> >
>> > This is typically something you should really tackle from a systems
>> perspective. Seek times are dramatically improved on SSD or similar
>> storage - think FusionIO cards, but there's also a couple of vendors
>> (Violin comes to mind) who provide full-blown SSD SANs.
>> >
>> > If you prefer staying with spinning disks, you could still improve
>the
>> seeks by focusing on the inner cylinders and potentially by using
>variable
>> sector formatting. Again, there's SANs that do this for you.
>> >
>> > Another minor trick is to turn off access timestamp updates when
>you
>> mount the filesystem (noatime).
>> >
>> > Also benchmark different filesystems, there

RE: hypothetical question about data storage

2013-07-26 Thread Rick James
"Count the disk hits"

If you have a filesystem directory, consider that it is designed to handle 
small numbers of files per directory.  Consider that there is a limited cache 
for directories, etc.  Plus there is the inode (vnode, whatever) storage for 
each file.  I don't know the details (and it varies wildly with "filesystem" 
(ext, xfs, zfs, etc)).

Looking at InnoDB...

Let's say you have a billion rows in a single table, and you need to fetch one 
row by the PRIMARY KEY, and it is a MD5 (sha-1, UUID, etc).  Such a key is 
_very_ random.

A billion rows would need about 5 levels of BTree.  The top levels would 
quickly all be cached.  (100M blocks * 16KB = 1.6GB.)  If the leaf nodes add up 
to 200GB, that is probably bigger than you innodb_buffer_pool_size.  In that 
case, a _random_ fetch is likely to be a cache miss.

A cache miss is about 100ms on normal rotating-media; perhaps 10ms on SSDs.  
This limits your reads to 10 (or 100) per second.

If you have big BLOBs in the table, then it gets messier.  InnoDB does not put 
more than 8K of a row in the actual 16KB block.  The rest is stored in another 
block(s).  So, it is likely to take an extra disk hit (200ms/20ms).

If your data size is 100 times as big as your buffer pool, then it becomes 
likely that the next level of the BTree won't be fully cacheable.  Now 
300ms/30ms.

I think it is likely that the small number of disk hits for InnoDB is better 
than the many disk hits for traversing a directory tree (with large 
directories) in the filesystem.  I vote for InnoDB over the directory tree.

Yes, you will have seeks.

No, adding more RAM won't help much.  Here's an argument:
Suppose your data is 20 times as big as the buffer pool and you are doing 
random fetches (MD5, etc).  Then 1/20 of fetches are cached; 95% cache miss.  
Estimated time: 0.95 * 100ms = 95ms.
Now you double your RAM.  1/10 cached -> 90% cache miss -> 90ms average -> Not 
much improvement over 95.

> -Original Message-
> From: ckn...@savage.za.org [mailto:ckn...@savage.za.org] On Behalf Of
> Chris Knipe
> Sent: Friday, July 26, 2013 12:30 AM
> To: Johan De Meersman
> Cc: mysql
> Subject: Re: hypothetical question about data storage
> 
> Hi All,
> 
> Thanks for the responces, and I do concur.  I was taking a stab in the
> dark so to speak.
> 
> We are working with our hosting providers currently and will be
> introducing a multitude of small iSCSI SANs to split the storage
> structure over a multitude of disks...   This is something that needs
> to be addressed from a systems perspective rather than an architectural
> one.
> 
> SSD (or Fusion and the like) are unfortunately still way to expensive for
> the capacity that we require (good couple of TBs) - so mechanical disks it
> would need to be.  However, with the use of SANs as we hope, we should be
> able to go up from 4 to over 64 spindles whilst still being able to share
> the storage and have redundancy.
> 
> Many thanks for the inputs and feedbacks...
> 
> --
> C
> 
> 
> On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman 
> wrote:
> > Hey Chris,
> >
> > I'm afraid that this is not what databases are for, and the first thing
> you'll likely run into is amount of concurrent connections.
> >
> > This is typically something you should really tackle from a systems
> perspective. Seek times are dramatically improved on SSD or similar
> storage - think FusionIO cards, but there's also a couple of vendors
> (Violin comes to mind) who provide full-blown SSD SANs.
> >
> > If you prefer staying with spinning disks, you could still improve the
> seeks by focusing on the inner cylinders and potentially by using variable
> sector formatting. Again, there's SANs that do this for you.
> >
> > Another minor trick is to turn off access timestamp updates when you
> mount the filesystem (noatime).
> >
> > Also benchmark different filesystems, there's major differences between
> them. I've heard XFS being recommended, but I've never needed to benchmark
> for seek times myself. We're using IBM's commercial GPFS here, which is
> good with enormous amounts of huge files (media farm here), not sure how
> it'd fare with smaller files.
> >
> > Hope that helps,
> > Johan
> >
> > - Original Message -
> >> From: "Chris Knipe" 
> >> To: mysql@lists.mysql.com
> >> Sent: Thursday, 25 July, 2013 11:53:53 PM
> >> Subject: hypothetical question about data storage
> >>
> >> Hi all,
> >>
> >> We run an VERY io intensive file application service.  Currently, our
> >> problem is that our disk spindles are being comp

Re: hypothetical question about data storage

2013-07-26 Thread Chris Knipe
Hi All,

Thanks for the responces, and I do concur.  I was taking a stab in the
dark so to speak.

We are working with our hosting providers currently and will be
introducing a multitude of small iSCSI SANs to split the storage
structure over a multitude of disks...   This is something that needs
to be addressed from a systems perspective rather than an
architectural one.

SSD (or Fusion and the like) are unfortunately still way to expensive
for the capacity that we require (good couple of TBs) - so mechanical
disks it would need to be.  However, with the use of SANs as we hope,
we should be able to go up from 4 to over 64 spindles whilst still
being able to share the storage and have redundancy.

Many thanks for the inputs and feedbacks...

--
C


On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman  wrote:
> Hey Chris,
>
> I'm afraid that this is not what databases are for, and the first thing 
> you'll likely run into is amount of concurrent connections.
>
> This is typically something you should really tackle from a systems 
> perspective. Seek times are dramatically improved on SSD or similar storage - 
> think FusionIO cards, but there's also a couple of vendors (Violin comes to 
> mind) who provide full-blown SSD SANs.
>
> If you prefer staying with spinning disks, you could still improve the seeks 
> by focusing on the inner cylinders and potentially by using variable sector 
> formatting. Again, there's SANs that do this for you.
>
> Another minor trick is to turn off access timestamp updates when you mount 
> the filesystem (noatime).
>
> Also benchmark different filesystems, there's major differences between them. 
> I've heard XFS being recommended, but I've never needed to benchmark for seek 
> times myself. We're using IBM's commercial GPFS here, which is good with 
> enormous amounts of huge files (media farm here), not sure how it'd fare with 
> smaller files.
>
> Hope that helps,
> Johan
>
> - Original Message -
>> From: "Chris Knipe" 
>> To: mysql@lists.mysql.com
>> Sent: Thursday, 25 July, 2013 11:53:53 PM
>> Subject: hypothetical question about data storage
>>
>> Hi all,
>>
>> We run an VERY io intensive file application service.  Currently, our
>> problem is that our disk spindles are being completely killed due to
>> insufficient SEEK time on the hard drives (NOT physical read/write
>> speeds).
>>
>> We have an directory structure where the files are stored based on
>> the MD5
>> checksum of the file name, i.e.
>> /0/00/000/44533779fce5cf3497f87de1d060
>> The majority of these files, are between 256K and 800K with the ODD
>> exception (say less than 15%) being more than 1M but no more than 5M
>> in
>> size.  The content of the files are pure text (MIME Encoded).
>>
>> We believe that storing these files into an InnoDB table, may
>> actually give
>> us better performance:
>> - There is one large file that is being read/written, instead of
>> BILLIONS of
>> small files
>> - We can split the structure so that each directory (4096 in total)
>> sit's on
>> their own database
>> - We can move the databases as load increases, which means that we
>> can
>> potentially run 2 physical database servers, each with 2048 databases
>> each)
>> - It's easy to move / migrate the data due to mysql and replication -
>> same
>> can be said for redundancy of the data
>>
>> We are more than likely looking at BLOB columns of course, and we
>> need to
>> read/write from the DB in excess of 100mbit/s
>>
>> Would the experts consider something like this as being feasible?  Is
>> it
>> worth it to go down this avenue, or are we just going to run into
>> different
>> problems?  If we are facing different problems, what can we possibly
>> expect
>> to go wrong here?
>>
>> Many thanks, and I look forward to any input.
>>
>
> --
> Unhappiness is discouraged and will be corrected with kitten pictures.



-- 

Regards,
Chris Knipe

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-26 Thread Johan De Meersman
Hey Chris,

I'm afraid that this is not what databases are for, and the first thing you'll 
likely run into is amount of concurrent connections.

This is typically something you should really tackle from a systems 
perspective. Seek times are dramatically improved on SSD or similar storage - 
think FusionIO cards, but there's also a couple of vendors (Violin comes to 
mind) who provide full-blown SSD SANs.

If you prefer staying with spinning disks, you could still improve the seeks by 
focusing on the inner cylinders and potentially by using variable sector 
formatting. Again, there's SANs that do this for you.

Another minor trick is to turn off access timestamp updates when you mount the 
filesystem (noatime).

Also benchmark different filesystems, there's major differences between them. 
I've heard XFS being recommended, but I've never needed to benchmark for seek 
times myself. We're using IBM's commercial GPFS here, which is good with 
enormous amounts of huge files (media farm here), not sure how it'd fare with 
smaller files.

Hope that helps,
Johan

- Original Message -
> From: "Chris Knipe" 
> To: mysql@lists.mysql.com
> Sent: Thursday, 25 July, 2013 11:53:53 PM
> Subject: hypothetical question about data storage
> 
> Hi all,
> 
> We run an VERY io intensive file application service.  Currently, our
> problem is that our disk spindles are being completely killed due to
> insufficient SEEK time on the hard drives (NOT physical read/write
> speeds).
> 
> We have an directory structure where the files are stored based on
> the MD5
> checksum of the file name, i.e.
> /0/00/000/44533779fce5cf3497f87de1d060
> The majority of these files, are between 256K and 800K with the ODD
> exception (say less than 15%) being more than 1M but no more than 5M
> in
> size.  The content of the files are pure text (MIME Encoded).
> 
> We believe that storing these files into an InnoDB table, may
> actually give
> us better performance:
> - There is one large file that is being read/written, instead of
> BILLIONS of
> small files
> - We can split the structure so that each directory (4096 in total)
> sit's on
> their own database
> - We can move the databases as load increases, which means that we
> can
> potentially run 2 physical database servers, each with 2048 databases
> each)
> - It's easy to move / migrate the data due to mysql and replication -
> same
> can be said for redundancy of the data
> 
> We are more than likely looking at BLOB columns of course, and we
> need to
> read/write from the DB in excess of 100mbit/s
> 
> Would the experts consider something like this as being feasible?  Is
> it
> worth it to go down this avenue, or are we just going to run into
> different
> problems?  If we are facing different problems, what can we possibly
> expect
> to go wrong here?
> 
> Many thanks, and I look forward to any input.
> 

-- 
Unhappiness is discouraged and will be corrected with kitten pictures.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: hypothetical question about data storage

2013-07-25 Thread Vahric Muhtaryan
Hi,
Sorry but mysql is not the address of it , use riak instead of mysql
With riak which is key and value based , all keys are on memory and just
only one seek enough to handle it
Consider to use riak
VM

On 7/26/13 12:53 AM, "Chris Knipe"  wrote:

>Hi all,
>
>We run an VERY io intensive file application service.  Currently, our
>problem is that our disk spindles are being completely killed due to
>insufficient SEEK time on the hard drives (NOT physical read/write
>speeds).
>
>We have an directory structure where the files are stored based on the MD5
>checksum of the file name, i.e. /0/00/000/44533779fce5cf3497f87de1d060
>The majority of these files, are between 256K and 800K with the ODD
>exception (say less than 15%) being more than 1M but no more than 5M in
>size.  The content of the files are pure text (MIME Encoded).
>
>We believe that storing these files into an InnoDB table, may actually
>give
>us better performance:
>- There is one large file that is being read/written, instead of BILLIONS
>of
>small files
>- We can split the structure so that each directory (4096 in total) sit's
>on
>their own database
>- We can move the databases as load increases, which means that we can
>potentially run 2 physical database servers, each with 2048 databases
>each)
>- It's easy to move / migrate the data due to mysql and replication - same
>can be said for redundancy of the data
>
>We are more than likely looking at BLOB columns of course, and we need to
>read/write from the DB in excess of 100mbit/s
>
>Would the experts consider something like this as being feasible?  Is it
>worth it to go down this avenue, or are we just going to run into
>different
>problems?  If we are facing different problems, what can we possibly
>expect
>to go wrong here?
>
>Many thanks, and I look forward to any input.
>
>--
>Chris.
>
>
>
>-- 
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:http://lists.mysql.com/mysql
>



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



hypothetical question about data storage

2013-07-25 Thread Chris Knipe
Hi all,

We run an VERY io intensive file application service.  Currently, our
problem is that our disk spindles are being completely killed due to
insufficient SEEK time on the hard drives (NOT physical read/write speeds).

We have an directory structure where the files are stored based on the MD5
checksum of the file name, i.e. /0/00/000/44533779fce5cf3497f87de1d060
The majority of these files, are between 256K and 800K with the ODD
exception (say less than 15%) being more than 1M but no more than 5M in
size.  The content of the files are pure text (MIME Encoded).

We believe that storing these files into an InnoDB table, may actually give
us better performance:
- There is one large file that is being read/written, instead of BILLIONS of
small files
- We can split the structure so that each directory (4096 in total) sit's on
their own database
- We can move the databases as load increases, which means that we can
potentially run 2 physical database servers, each with 2048 databases each)
- It's easy to move / migrate the data due to mysql and replication - same
can be said for redundancy of the data

We are more than likely looking at BLOB columns of course, and we need to
read/write from the DB in excess of 100mbit/s

Would the experts consider something like this as being feasible?  Is it
worth it to go down this avenue, or are we just going to run into different
problems?  If we are facing different problems, what can we possibly expect
to go wrong here?

Many thanks, and I look forward to any input.

--
Chris.



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql