Re: Very slow LVM performance
Arcady Genkin put forth on 7/12/2010 10:49 PM: > After dealing with all the idiosyncrasies of iSCSI and software RAID > under Linux I am a bit skeptical whether what we are building is going > to actually be better than a black-box fiber-attached RAID solution, > but it surely is cheaper and more expandable. I share your skepticism. Cheaper in initial acquisition cost, yes, but maybe not long term reliability and serviceability. Have you performed manual catastrophic iSCSI target node failure tests yet, monitored the node/disk/array reconstruction process to verify it all works as expected without user interruption? This is always the main concern with homegrown storage systems of this nature, and where "black box" solutions typically prove themselves more cost effective (at least in user good will $$) than home brew solutions. I myself am a fan of Nexsan storage arrays. They offer some of the least expensive and most feature rich and performant FC and iSCSI arrays on the market. Given what you've built, it would appear the SATABeast would fit your needs. 42 SATA drives in a 4U chassis, dual controllers with 4 x 4Gb FC ports and 4 x 1GbE iSCSI ports, 600MB/s sustained per controller, 1.2GB/s with both controllers, up to 4GB read/write battery backed cache per controller, web management/snmp/email alerts via mngt 10/100 ethernet port, etc, etc. The web management interface is particularly nice making it almost too easy to configure and manage arrays and LUN assignments. http://www.nexsan.com/satabeast.php One of these will run somewhere between $20-40k depending on disk qty/size/rpm and whether you want/need both controllers. They also offer an SAS version with 15krpm drives at higher cost. I've installed a couple of the single controller SATABeast models and the discontinued SATABlade model. They've performed flawlessly, no drive failures to date. Last I checked Nexsan still uses only Hitachi (formerly IBM) UltraStar drives. Good product/solution all around. If you end up in the market for a "black box" storage solution after all, I'd recommend you start your search with Nexsan. I'm not selling here, just a very happy customer. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3ce8e8.1010...@hardwarefreak.com
Re: Very slow LVM performance
On Mon, Jul 12, 2010 at 22:28, Stan Hoeppner wrote: > I'm curious as to why you're (apparently) wasting 2/3 of your storage for > redundancy. Have you considered a straight RAID 10 across those 30 > disks/LUNs? This is a very good question. And the answer is: because Linux's MD does not implement RAID10 the way we expected (as you have found out for yourself). We started out thinking exactly that we'd have a RAID10 stripe with cardinality of 3, instead of the multi-layered MD design. But for us it's important to have full control over what physical disks form the triplets (see below for discussion); instead, MD's so-called RAID10 only guarantees that there will be exactly N copies of each chunk on N different drives, but makes on promise as to on *which* drives. The reason the drive assignment is important to us is that we can achieve more data redundancy if we form each triplet from an iSCSI disk that lives on a different iSCSI target (host). Suppose that you have six iSCSI target hosts h0 through h5, and each of them has five disks d0 through d4. Then if you form the first triplet as (h0:d0, h1:d0, h2:d0), and so forth until (h3:d4, h4:d4, h5:d4), then if any iSCSI host goes down for whatever reason, then all triplets still stay up and are still redundant, only running on two copies instead of three. Linux's RAID10 implementation did not allow us to do this. So we had to layer by first creating RAID1 (or RAID10 with n=3) triplets, and striping them in a higher layer. > I'm also curious as to why you're running software RAID at all given the fact > than pretty much every iSCSI target is itself an array controller with built > in hardware RAID. Can you tell us a little bit about your iSCSI target > devices? Our boss wanted us to only use commodity hardware to build this solution, so we don't employ any fancy RAID controllers - all drives are connected to on-board SATA ports. Staying away from the "black box" implementations as much as possible was also part of the wish list. After dealing with all the idiosyncrasies of iSCSI and software RAID under Linux I am a bit skeptical whether what we are building is going to actually be better than a black-box fiber-attached RAID solution, but it surely is cheaper and more expandable. -- Arcady Genkin -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinwfqgcutlg6rm0c5pmgtgzwnvfxc-kbzsdq...@mail.gmail.com
Re: Very slow LVM performance
On 07/12/2010 06:26 PM, Stan Hoeppner wrote: > Now, you can argue what RAID 10 is from now until you are blue in the face, > and the list is tired of hearing it. But that won't change the industry > definition of RAID 10. It's been well documented for over 15 years and won't > be changing any time soon. Your lack of understanding content and subject matter is rather unfortunate. -- . O . O . O . . O O . . . O . . . O . O O O . O . O O . . O O O O . O . . O O O O . O O O signature.asc Description: OpenPGP digital signature
Re: Very slow LVM performance
On Mon, Jul 12, 2010 at 20:06, Stan Hoeppner wrote: > I had the same reaction Mike. Turns out mdadm actually performs RAID 1E with > 3 disks when you specify RAID 10. I'm not sure what, if any, benefit RAID 1E > yields here--almost nobody uses it. The people who are surprised to see us do RAID10 over three devices probably overlooked that we do RAID10 with cardinality of 3, which, in combination with "--layout=n3" is almost an equivalent of creating a three-way RAID1 mirror. I'm saying "almost" because it's equivalent in as much as each of the three disks is an exact copy of the others, but the difference is in performance. We found out empirically (and then confirmed by reading a number of posts on the 'net) that MD does not implement RAID1 in, let's say, the most desirable way. In particular, it does not make use of the data redundancy for reading when you have only one process doing the reading. In other words, if you have a three-way RAID1 mirror, and only one reader process, MD would read from only one of the disks, so you don't get performance benefit from using the mirror. If you have more than one large read, or more than one process reading, then MD does the right thing and uses the disks in what seems to be a round robin algorithm (I may be wrong about this). When we tried using RAID10 with n=3 instead of RAID1, we saw much better performance. And we verified that all three disks are bit-to-bit exact copies. > I just hope the OP gets prompt and concise drive failure information the > instant one goes down, and has a tested array rebuild procedure in place. > Rebuilding a failed drive in this kind of setup may get a bit hairy. Actually, it's the other way around because you get quite a bit of redundancy from doing the three-way mirroring. You are still redundant if you loose just one drive, and we are planning to have about four global hot spares standing by in case a drive fails. -- Arcady Genkin -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktil-nzmsyi8ubqnblrbhqkvdr4angpgsvi9uh...@mail.gmail.com
Re: Very slow LVM performance
Arcady Genkin put forth on 7/12/2010 12:45 PM: > I just tried to use LVM for striping the RAID1 triplets together > (instead of MD). Using the following three commands to create the > logical volume, I get 550 MB/s sequential read speed, which is quite > faster than before, but is still 10% slower than what plain MD RAID0 > stripe can do with the same disks (612 MB/s). > > pvcreate /dev/md{0,5,1,6,2,7,3,8,4,9} > vgcreate vg0 /dev/md{0,5,1,6,2,7,3,8,4,9} > lvcreate -i 10 -I 1024 -l 102390 vg0 > > test4:~# dd of=/dev/null bs=8K count=250 if=/dev/vg0/lvol0 > 250+0 records in > 250+0 records out > 2048000 bytes (20 GB) copied, 37.2381 s, 550 MB/s > > I would still like to know why LVM on top of RAID0 performs so poorly > in our case. I'm curious as to why you're (apparently) wasting 2/3 of your storage for redundancy. Have you considered a straight RAID 10 across those 30 disks/LUNs? Performance should be enhanced by about 50% or more over your current setup (assuming you're not hitting your ethernet b/w limits currently), and you'd only be losing half your storage to fault tolerance instead of 2/3rds of it. RAID 10 has the highest fault tolerance of all standard RAID levels and higher performance than anything but a straight stripe. I'm guessing lvm wouldn't have any problems atop a straight mdadm RAID 10 across those 30 disks. I'm also guessing the previous lvm problem you had was probably due to running it atop nested mdadm RAID devices. Straight mdadm RAID 10 doesn't create or use nested devices. I'm also curious as to why you're running software RAID at all given the fact than pretty much every iSCSI target is itself an array controller with built in hardware RAID. Can you tell us a little bit about your iSCSI target devices? -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3bcf4a.6090...@hardwarefreak.com
Re: Very slow LVM performance
Aaron Toponce put forth on 7/12/2010 6:56 PM: > The argument is not whether Linux software RAID 10 is standard or not, > but the requirement of the number of disks that Linux software RAID > supports. In this case, it supports 2+ disks, regardless what its > "effectiveness" is. Yes, it is the argument. The argument is ensuring _accurate_ information is presented here for the benefit of others who will go searching for this information. The _accurate_ information is that Linux software md RAID 10 on anything less than 4 disks, or using the md RAID 10 "F2" layout on any number of disks, is not standard RAID 10. That is a very important distinction to make, and that's the reason I'm making it. That's what the current "argument" is about. I made the statement that you can't run RAID 10 on 3 disks, and I and the list, were told that the information I presented was "incorrect". It wasn't incorrect at all. The information presented in rebuttal to it is what was incorrect. I'm setting the record straight. Now, you can argue what RAID 10 is from now until you are blue in the face, and the list is tired of hearing it. But that won't change the industry definition of RAID 10. It's been well documented for over 15 years and won't be changing any time soon. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3bb29b.20...@hardwarefreak.com
Re: Very slow LVM performance
Roger Leigh put forth on 7/12/2010 5:45 PM: > Have a closer look at lvcreate(8). The last arguments are: > >[-Z|--zero y|n] VolumeGroupName [PhysicalVolumePath[:PE[-PE]]...] Good catch. As I said I've never used it before, so I wasn't exactly sure how it all fits. Seemed logical that when he went from testing the mdadm device to the lvm volume and lost almost exactly 10x that a striping issue wrt lvm may be in play. > AFAICT the striping options are entirely pointless when layered on > RAID, and could be responsible for the performance issues if it > can have a negative impact (such as thrashing the disks if you > tell it to write multiple stripes to a single disc). I would have thought so as well, but didn't understand the exact function of -i at the time. I thought it was more like the xfs "-d sw=" switch. >From another post it looks like the OP is making some good progress, although there are still some minor questions unanswered. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3bb03a.3050...@hardwarefreak.com
Re: Very slow LVM performance
Mike Bird put forth on 7/12/2010 4:00 PM: > On Mon July 12 2010 12:45:57 Arcady Genkin wrote: >> Creating the ten 3-way RAID1 triplets - for N in 0 through 9: >> mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \ >> --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \ >> --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ > > RAID 10 with three devices? I had the same reaction Mike. Turns out mdadm actually performs RAID 1E with 3 disks when you specify RAID 10. I'm not sure what, if any, benefit RAID 1E yields here--almost nobody uses it. RAID 0 over (10 * RAID 1E) over 6 iSCSI targets isn't something I've ever seen anyone do. Not saying it's bad, just...unique. I just hope the OP gets prompt and concise drive failure information the instant one goes down, and has a tested array rebuild procedure in place. Rebuilding a failed drive in this kind of setup may get a bit hairy. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3badff.9050...@hardwarefreak.com
Re: Very slow LVM performance
On 7/12/2010 5:52 PM, Stan Hoeppner wrote: > Aaron Toponce put forth on 7/12/2010 5:16 PM: >> On 7/12/2010 4:13 PM, Stan Hoeppner wrote: >>> Is that a typo, or are you turning those 3 disk mdadm sets into RAID10 as >>> shown above, instead of the 3-way mirror sets you stated previously? RAID >>> 10 >>> requires a minimum of 4 disks, you have 3. Something isn't right here... >> >> Incorrect. The Linux RAID implementation can do level 10 across 3 disks. >> In fact, it can even do it across 2 disks. > > Only throw the bold "incorrect" or "correct" statements around when you really > know the subject material. You don't. Linux md RAID 10 is not standard RAID > 10 when used on 2 and 3 drives. When used on 3 drives it's actually RAID 1E, > and on two drives it's the same as RAID1. Another Wikipedia article linked > within the one you quoted demonstrates this. Note the page title > "Non-standard_RAID_levels". The argument is not whether Linux software RAID 10 is standard or not, but the requirement of the number of disks that Linux software RAID supports. In this case, it supports 2+ disks, regardless what its "effectiveness" is. Try again. -- . O . O . O . . O O . . . O . . . O . O O O . O . O O . . O O O O . O . . O O O O . O O O signature.asc Description: OpenPGP digital signature
Re: Very slow LVM performance
Aaron Toponce put forth on 7/12/2010 5:16 PM: > On 7/12/2010 4:13 PM, Stan Hoeppner wrote: >> Is that a typo, or are you turning those 3 disk mdadm sets into RAID10 as >> shown above, instead of the 3-way mirror sets you stated previously? RAID 10 >> requires a minimum of 4 disks, you have 3. Something isn't right here... > > Incorrect. The Linux RAID implementation can do level 10 across 3 disks. > In fact, it can even do it across 2 disks. Only throw the bold "incorrect" or "correct" statements around when you really know the subject material. You don't. Linux md RAID 10 is not standard RAID 10 when used on 2 and 3 drives. When used on 3 drives it's actually RAID 1E, and on two drives it's the same as RAID1. Another Wikipedia article linked within the one you quoted demonstrates this. Note the page title "Non-standard_RAID_levels". http://en.wikipedia.org/wiki/Non-standard_RAID_levels Linux MD RAID 10 The Linux kernel software RAID driver (called md, for "multiple device") can be used to build a classic RAID 1+0 array, but also (since version 2.6.9) as a single level[4] with some interesting extensions[5]. The standard "near" layout, where each chunk is repeated n times in a k-way stripe array, is equivalent to the standard RAID-10 arrangement, but it does not require that n divide k. For example an n2 layout on 2, 3 and 4 drives would look like: 2 drives 3 drives4 drives ---- A1 A1 A1 A1 A2A1 A1 A2 A2 A2 A2 A2 A3 A3A3 A3 A4 A4 A3 A3 A4 A4 A5A5 A5 A6 A6 A4 A4 A5 A6 A6A7 A7 A8 A8 .. .. .. .. .... .. .. .. *The 4-drive example is identical to a standard RAID-1+0 array, while the 3-drive example is a software implementation of RAID-1E. The 2-drive example is equivalent RAID 1.* -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3baabe.2080...@hardwarefreak.com
Re: Very slow LVM performance
On Mon July 12 2010 15:16:47 Aaron Toponce wrote: > Incorrect. The Linux RAID implementation can do level 10 across 3 disks. > In fact, it can even do it across 2 disks. > > http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 Thanks, I learned something new today. Now I guess the question is, does LVM understand the performance implications of 10 RAID-1E PV's, or would the OP be better off assigning his 30 devices as 15 RAID-1 PV's. --Mike Bird -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/201007121559.05684.mgb-deb...@yosemite.net
Re: Very slow LVM performance
On Mon, Jul 12, 2010 at 05:13:16PM -0500, Stan Hoeppner wrote: > Arcady Genkin put forth on 7/12/2010 11:52 AM: > > On Mon, Jul 12, 2010 at 02:05, Stan Hoeppner wrote: > > > >> lvcreate -i 10 -I [stripe_size] -l 102389 vg0 > >> > >> I believe you're losing 10x performance because you have a 10 "disk" mdadm > >> stripe but you didn't inform lvcreate about this fact. > > > > Hi, Stan: > > > > I believe that the -i and -I options are for using *LVM* to do the > > striping, am I wrong? > > If this were the case, lvcreate would require the set of physical or pseudo > (mdadm) device IDs to stripe across wouldn't it? There are no options in > lvcreate to specify physical or pseudo devices. The only input to lvcreate is > a volume group ID. Therefor, lvcreate is ignorant of the physical devices > underlying it, is it not? Have a closer look at lvcreate(8). The last arguments are: [-Z|--zero y|n] VolumeGroupName [PhysicalVolumePath[:PE[-PE]]...] So after the VG, you can specify explicitly the exact PEs within that VG to stripe across that the -I/-i options configure. I'm unsure why one would necessarily /want/ to do that. I run LVM on top of md RAID1. Here, I have a single PV on top of the RAID array, and I can't see that adding additional striping on top of that would benefit performance in any way. I can only assume it makes sense if you /don't/ have underlying RAID and want to tell LVM to stripe over multiple PVs on different physical discs, which /would/ have some performance impact since you spread the I/O over multiple discs. AFAICT the striping options are entirely pointless when layered on RAID, and could be responsible for the performance issues if it can have a negative impact (such as thrashing the disks if you tell it to write multiple stripes to a single disc). Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: Very slow LVM performance
On 7/12/2010 4:13 PM, Stan Hoeppner wrote: > Is that a typo, or are you turning those 3 disk mdadm sets into RAID10 as > shown above, instead of the 3-way mirror sets you stated previously? RAID 10 > requires a minimum of 4 disks, you have 3. Something isn't right here... Incorrect. The Linux RAID implementation can do level 10 across 3 disks. In fact, it can even do it across 2 disks. http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 -- . O . O . O . . O O . . . O . . . O . O O O . O . O O . . O O O O . O . . O O O O . O O O signature.asc Description: OpenPGP digital signature
Re: Very slow LVM performance
Arcady Genkin put forth on 7/12/2010 11:52 AM: > On Mon, Jul 12, 2010 at 02:05, Stan Hoeppner wrote: > >> lvcreate -i 10 -I [stripe_size] -l 102389 vg0 >> >> I believe you're losing 10x performance because you have a 10 "disk" mdadm >> stripe but you didn't inform lvcreate about this fact. > > Hi, Stan: > > I believe that the -i and -I options are for using *LVM* to do the > striping, am I wrong? If this were the case, lvcreate would require the set of physical or pseudo (mdadm) device IDs to stripe across wouldn't it? There are no options in lvcreate to specify physical or pseudo devices. The only input to lvcreate is a volume group ID. Therefor, lvcreate is ignorant of the physical devices underlying it, is it not? > In our case (when LVM sits on top of one RAID0 > MD stripe) the option -i does not seem to make sense: > > test4:~# lvcreate -i 10 -I 1024 -l 102380 vg0 > Number of stripes (10) must not exceed number of physical volumes (1) It makes sense once you accept the fact that lvcreate is ignorant of the underlying disk device count/configuration. Once you accept that fact, you will realize the -i option is what allows one to educate lvcreate that there are, in your case, 10 devices underlying it which one desires to stripe data across. I believe the -i option exists merely to educate lvcreate about the underlying device structure. > My understanding is that LVM should be agnostic of what's underlying > it as the physical storage, so it should treat the MD stripe as one > large disk, and thus let the MD device to handle the load balancing > (which it seems to be doing fine). If lvcreate is agnostic of the underlying structure, why does it have stripe width and stripe size options at all? As a parallel example of this, filesystems such as XFS are ignorant of underlying disk structure as well. mkfs.xfs has no less than 4 sub options to optimize its performance atop RAID stripes. One of it's options, sw, specifies stripe width, which is the number of physical or logical devices in the RAID stripe. In your case, if you use xfs, this would be "-d sw=10". These options in lvcreate serve the same function as those in mkfs.xfs, which is to optimize their performance atop a RAID stripe. > Besides, the speed we are getting from the LVM volume is more than > twice slower than an individual component of the RAID10 stripe. Even > if we assume that LVM manages somehow distribute its data so that it > always hits only one physical disk (a disk triplet in our case), there > would still be the question why it is doing it *that* slow. It's 57 > MB/s vs 134 MB/s that an individual triplet can do: Forget comparing performance to one of your single mdadm mirror sets. What's key here, and why I suggested "lvcreate -i 10 .." to begin with, is the fact that your lvm performance is almost exactly 10 times lower than the underlying mdadm device, which has exactly 10 physical stripes. Isn't that more than just a bit coincidental? The 10x drop only occurs when talking to the lvm device. Put on your Sherlock Holmes hat for a minute. > We are using chunk size of 1024 (i.e. 1MB) with the MD devices. For > the record, we used the following commands to create the md devices: > > For N in 0 through 9: > mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \ > --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \ > --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ Is that a typo, or are you turning those 3 disk mdadm sets into RAID10 as shown above, instead of the 3-way mirror sets you stated previously? RAID 10 requires a minimum of 4 disks, you have 3. Something isn't right here... > Then the big stripe: > mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \ > --metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9} And I'm pretty sure this is the stripe lvcreate needs to know about to fix the 10x performance drop issue. Create a new lvm test volume with the lvcreate options I've mentioned, and see how it performs against the current 400GB test volume that's running slow. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3b937c.1080...@hardwarefreak.com
Re: Very slow LVM performance
On Mon July 12 2010 12:45:57 Arcady Genkin wrote: > Creating the ten 3-way RAID1 triplets - for N in 0 through 9: > mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \ > --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \ > --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ RAID 10 with three devices? --Mike Bird -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/201007121400.42233.mgb-deb...@yosemite.net
Re: Very slow LVM performance
On 7/12/2010 1:45 PM, Arcady Genkin wrote: > Creating the ten 3-way RAID1 triplets - for N in 0 through 9: > mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \ > --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \ > --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ > > Then the big stripe: > mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \ > --metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9} I must admit, that I haven't seen a software RAID implementation where you create multiple devices from the same set of disks, then stripe across those devices. As such, when using LVM, I'm not exactly sure how the kernel will handle that- mostly if it will see the appropriate amount of disk, and what physical extents it will use to place the data. So for me, this is uncharted territory. But, your commands look sound. I might suggest changing the default PE size from 4MB to 1MB. That might help. Worth testing anyway. The PE size can be changed with 'vgcreate -s 1M'. However, do you really want --bitmap with your mdadm command? I understand the benefits, but using 'internal' does come with a performance hit. > From the man page to 'lvcreate' it seems that the -c option sets the > chunk size for something snapshot-related, so it should have no > bearing in our performance testing, which involved no snapshots. Am I > misreading the man page? Ah yes, you are correct. I should probably pull up the man page before replying. :) -- . O . O . O . . O O . . . O . . . O . O O O . O . O O . . O O O O . O . . O O O O . O O O signature.asc Description: OpenPGP digital signature
Re: Very slow LVM performance
On Mon, Jul 12, 2010 at 14:54, Aaron Toponce wrote: > Can you provide the commands from start to finish when building the volume? > > fdisk ... > mdadm ... > pvcreate ... > vgcreate ... > lvcreate ... Hi, Aaron, I already provided all of the above commands in earlier messages (except for fdisk, since we are giving the entire disks to MD, not partitions). I'll repeat them here for your convenience: Creating the ten 3-way RAID1 triplets - for N in 0 through 9: mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \ --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \ --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ Then the big stripe: mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \ --metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9} Then the LVM business: pvcreate /dev/md10 vgcreate vg0 /dev/md10 lvcreate -l 102389 vg0 Note that the file system is not being created on top of LVM at this point, and I ran the test by simply dd-ing /dev/vg0/lvol0. > My experience has been that LVM will introduce about a 1-2% performance > hit compared to not using it This is what we were expecting, it's encouraging. > On a side note, I've never seen any reason to increase or decrease the > chunk size with software RAID. However, you may want to match your chunk > size with '-c' for 'lvcreate'. We have tested a variety of chunk sizes (from 64K to 4MB) with bonnie++ and found that 1MB chunks worked the best for our usage, which is a general purpose NFS server, so it's mainly small random reads. In this scenario it's best to tune the chunk size to increase the probability that a small read from the stripe would result in only one read from the disk. If the chunk size is too small, then a 1KB read has a pretty high chance to be fragmented between two chunks, and, thus, require two I/Os to service instead of one I/O (and, thus, most likely two drive head seeks instead of just one). Modern commodity drives can do about only 100-120 seeks per second. But this is a side note for your side note. :)) >From the man page to 'lvcreate' it seems that the -c option sets the chunk size for something snapshot-related, so it should have no bearing in our performance testing, which involved no snapshots. Am I misreading the man page? Thanks! -- Arcady Genkin -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktild4umo3vaq7h2fokbsnt8xl2fyi-8vtnpfm...@mail.gmail.com
Re: Very slow LVM performance
On 7/12/2010 11:45 AM, Arcady Genkin wrote: > I would still like to know why LVM on top of RAID0 performs so poorly > in our case. Can you provide the commands from start to finish when building the volume? fdisk ... mdadm ... pvcreate ... vgcreate ... lvcreate ... etc. My experience has been that LVM will introduce about a 1-2% performance hit compared to not using it, in many different situations, whether it be on top of software/hardware RAID, on plain disk/partitions. So, I'm curious what commandline options you're passing to each of your commands, how your partitioned/built your disks, and so forth. Might help troubleshoot why you're seeing such a hit. On a side note, I've never seen any reason to increase or decrease the chunk size with software RAID. However, you may want to match your chunk size with '-c' for 'lvcreate'. -- . O . O . O . . O O . . . O . . . O . O O O . O . O O . . O O O O . O . . O O O O . O O O signature.asc Description: OpenPGP digital signature
Re: Very slow LVM performance
I just tried to use LVM for striping the RAID1 triplets together (instead of MD). Using the following three commands to create the logical volume, I get 550 MB/s sequential read speed, which is quite faster than before, but is still 10% slower than what plain MD RAID0 stripe can do with the same disks (612 MB/s). pvcreate /dev/md{0,5,1,6,2,7,3,8,4,9} vgcreate vg0 /dev/md{0,5,1,6,2,7,3,8,4,9} lvcreate -i 10 -I 1024 -l 102390 vg0 test4:~# dd of=/dev/null bs=8K count=250 if=/dev/vg0/lvol0 250+0 records in 250+0 records out 2048000 bytes (20 GB) copied, 37.2381 s, 550 MB/s I would still like to know why LVM on top of RAID0 performs so poorly in our case. -- Arcady Genkin -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktilcdxiuexhnmb7jf9cxz9k_2tkvi_2qsjtld...@mail.gmail.com
Re: Very slow LVM performance
On Mon, Jul 12, 2010 at 02:05, Stan Hoeppner wrote: > lvcreate -i 10 -I [stripe_size] -l 102389 vg0 > > I believe you're losing 10x performance because you have a 10 "disk" mdadm > stripe but you didn't inform lvcreate about this fact. Hi, Stan: I believe that the -i and -I options are for using *LVM* to do the striping, am I wrong? In our case (when LVM sits on top of one RAID0 MD stripe) the option -i does not seem to make sense: test4:~# lvcreate -i 10 -I 1024 -l 102380 vg0 Number of stripes (10) must not exceed number of physical volumes (1) My understanding is that LVM should be agnostic of what's underlying it as the physical storage, so it should treat the MD stripe as one large disk, and thus let the MD device to handle the load balancing (which it seems to be doing fine). Besides, the speed we are getting from the LVM volume is more than twice slower than an individual component of the RAID10 stripe. Even if we assume that LVM manages somehow distribute its data so that it always hits only one physical disk (a disk triplet in our case), there would still be the question why it is doing it *that* slow. It's 57 MB/s vs 134 MB/s that an individual triplet can do: test4:~# dd of=/dev/null bs=8K count=250 if=/dev/md0 250+0 records in 250+0 records out 2048000 bytes (20 GB) copied, 153.084 s, 134 MB/s > If you specified a chunk size when you created the mdadm RAID 0 stripe, then > use that chunk size for the lvcreate stripe_size. Again, if performance is > still lacking, recreate with whatever chunk size you specified in mdadm and > multiply that by 10. We are using chunk size of 1024 (i.e. 1MB) with the MD devices. For the record, we used the following commands to create the md devices: For N in 0 through 9: mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \ --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \ --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ Then the big stripe: mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \ --metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9} Thanks, -- Arcady Genkin -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktilk5for3gq2w9kvajfe7vgzvqmagyjjbkvfl...@mail.gmail.com
Re: Very slow LVM performance
Arcady Genkin put forth on 7/11/2010 10:46 PM: > lvcreate -l 102389 vg0 Should be: lvcreate -i 10 -I [stripe_size] -l 102389 vg0 I believe you're losing 10x performance because you have a 10 "disk" mdadm stripe but you didn't inform lvcreate about this fact. Delete the vg, and then recreate the vg with the above command line, specifying 64 for the stripe size (the mdadm default). If performance is still lacking, recreate it again with 640 for the stripe size. (I'm not exactly sure of the relationship between mdadm chunk size and lvm stripe size--it's either equal, or it's mdadm stripe width * mdadm chunk size) If you specified a chunk size when you created the mdadm RAID 0 stripe, then use that chunk size for the lvcreate stripe_size. Again, if performance is still lacking, recreate with whatever chunk size you specified in mdadm and multiply that by 10. Hope this helps. Let us know. -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4c3ab09a.4090...@hardwarefreak.com
Very slow LVM performance
I'm seeing a 10-fold performance hit when using an LVM2 logical volume that sits on top of a RAID0 stripe. Using dd to read directly from the stripe (i.e. a large sequential read) I get speeds over 600MB/s. Reading from the logical volume using the same method only gives around 57MB/s. I am new to LVM and I need to for the snapshots. Would anyone suggest where to start looking for the problem? The server runs the amd64 version of Lenny. Most packages (including lvm2) are stock from Lenny, but we had to upgrade the kernel to the one from lenny-backports (2.6.32). There are ten RAID1 triplets: md0 through md9 (that's 30 physical disks arranged into ten 3-way mirrors), connected over iSCSI from six targets. The ten triplets are then striped together into a RAID0 stripe /dev/md10. I don't think we have any issues with the MD layers, because each of them seems to perform fairly well; it's when we add LVM into the soup the speeds start getting slow. test4:~# uname -a Linux test4 2.6.32-bpo.4-amd64 #1 SMP Thu Apr 8 10:20:24 UTC 2010 x86_64 GNU/Linux test4:~# dd of=/dev/null bs=8K count=250 if=/dev/md10 250+0 records in 250+0 records out 2048000 bytes (20 GB) copied, 33.4619 s, 612 MB/s test4:~# dd of=/dev/null bs=8K count=250 if=/dev/vg0/lvol0 250+0 records in 250+0 records out 2048000 bytes (20 GB) copied, 354.951 s, 57.7 MB/s I used the following commands to create the volume group: pvcreate /dev/md10 vgcreate vg0 /dev/md10 lvcreate -l 102389 vg0 Here's what LVM reports of its devices: test4:~# pvdisplay --- Physical volume --- PV Name /dev/md10 VG Name vg0 PV Size 399.96 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 102389 Free PE 0 Allocated PE 102389 PV UUID ocIGdd-cqcy-GNQl-jxRo-FHmW-THMi-fqofbd test4:~# vgdisplay --- Volume group --- VG Name vg0 System ID Formatlvm2 Metadata Areas1 Metadata Sequence No 2 VG Access read/write VG Status resizable MAX LV0 Cur LV1 Open LV 0 Max PV0 Cur PV1 Act PV1 VG Size 399.96 GB PE Size 4.00 MB Total PE 102389 Alloc PE / Size 102389 / 399.96 GB Free PE / Size 0 / 0 VG UUID o2TeAm-gPmZ-VvJc-OSfU-quvW-OB3a-y1pQaB test4:~# lvdisplay --- Logical volume --- LV Name/dev/vg0/lvol0 VG Namevg0 LV UUIDQ3nA6w-0jgw-ImWY-IYJK-kvMJ-aybW-GAdoOs LV Write Accessread/write LV Status available # open 0 LV Size399.96 GB Current LE 102389 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 254:0 Many thanks in advance for any pointers! -- Arcady Genkin -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktiksmhwitdv1_iji72tak_1irx9dxpj2mccah...@mail.gmail.com