On 6/14/2012 8:02 AM, Ramon Hofer wrote: > AF drives are Advanced Format drives with more than 512 bytes per > sector right?
Correct. Advanced Format is the industry wide name chosen for drives that have 4096B physical sectors, but present 512B sectors at the interface level, doing translation internally, "transparently". > I don't trust anybody ;-) Good for you! :) > Here's what I was referring to: > http://www.mythtv.org/docs/mythtv-HOWTO-3.html > JFS is the absolute best at > deletion, so you may want to try it if XFS gives you problems. Interesting. Lets see: ~$ time dd if=/dev/zero of=myth-test bs=8192 count=512000 512000+0 records in 512000+0 records out 4194304000 bytes (4.2 GB) copied, 50.1455 s, 83.6 MB/s real 0m50.167s user 0m1.560s sys 0m43.915s -rw-r--r-- 1 root root 4.0G Jun 15 04:52 myth-test ~$ echo 3 > /proc/sys/vm/drop_caches ~$ time rm myth-test; sync real 0m0.027s user 0m0.000s sys 0m0.004s XFS and the kernel block layer required 4ms to perform the 4GB file delete. The disk access required 23ms. What does this say about the JFS claim? I simply don't get the "if XFS gives you problems" bit. The author was obviously nothing close to a filesystem expert. > > I additionally found a foum post from four years ago were someone > states that xfs has problems with interrupted power supply: > http://www.linuxquestions.org/questions/linux-general-1/xfs-or-jfs-685745/#post3352854 "I found a forum post from 4 years ago" Myths, lies, and fairy tales. There was an XFS bug related to power fail that was fixed over a year before this forum post was made. Note that nobody in that thread posts anything from the authoritative source, as I do here? http://www.xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULLS_in_some_files_after_recovery_when_I_unplugged_the_power.3F > "I only advise XFS if you have any means to guarantee uninterrupted > power supply. It's not the most resistant fs when it comes to power > outages." I advise using a computer only if you have a UPS, no matter what filesystem you use. It's incredibe that this guy would make such a statement, instead of promoting the use of UPS devices. Abrupt power loss, or worse, voltage "bumping" which often accompanies brown conditions, is not good for any computer equipment, especially PSUs and mechanical hard drives, regardless of what filesystem one uses. The only data lost due to power failure is inflight write data. The vast majority of that is going to be due to Linux buffer cache. No matter what FS you use, if you're writing, especially a large file, when power dies the write has failed and you've lost that file. EXT3 was a bit more "resilient" to power loss because of a bug, not a design goal. The same bug caused horrible performance with some workloads because of the excessive hard coded syncs. > I usually don't have blackouts. At least as long that the PC turn off. > But I don't have a UPS. Get one. Best investment you'll ever make computer-wise. For your Norco, we'll assume all 20 bays are filled for sizing purposes. One of these should be large enough to run your server and your desktop: http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=BR900G-GR&total_watts=200 http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=BR900GI&total_watts=200 (Sorry if I mis guessed your native language as German instead of French or Italian) I listed both units as I don't know which power plug configuration you need. If these UPS seem expensive, consider the fact that they may continue working for 20+ years. I bought my home office APC SU1400RMNET used in 2003 for US $250 ($1000+ new) after it had been in corporate service for 3 years on lease. It's at least 12 years old and I've been running it for 9 years continuously. I've replaced the batteries ($80) twice, about every 4 years. Buying this unit used, at a steal of a price, is one of the best investments I ever made. I expect it to last at least another 8 years, if not more. > I will get better performance if I have the correct parameters. Yes. > >> 2. If device is a single level md striped array, AGs=16, unless the >> device size is > 16TB. In that case AGs=device_size/1TB. > > A single level md striped array is any linux raid containing disks. > Like my raid5. I use "single level" simply to differentiate from a nested array, which is multi-level. > In contrast would be my linear raid containing one or more raids? This is called a "nested" array. The term comes from "nested loop" in programming. > Ok, the chunck (=stripe) chunk = "strip", not "stripe" "Chunk" and "strip" are two words for the same thing. Linux md uses the term "chunk". LSI and other hardware vendors use the term "strip". They describe the amount of data written to an individual array disk during a striped write operation. Stripe is equal to all of the chunks/strips added together. E.g. A 16 disk RAID10 has 8 stripe spindles (8 are mirrors). Each spindle has a chunk/strip size of 64KB. 8*64KB = 512KB. So the "stripe" size is 512KB. > size is already set 128 kB when creating the > raid5 with the command you provided earlier: > > ~$ mdadm -C /dev/md1 -c 128 -n4 -l5 /dev/sd[abcd] > > Then the mkfs.xfs parameters are adapted to this. Correct. If you were just doing a single level RAID5 array, and not nesting it into a linear array, mkfs.xfs would read the md RAID5 parms and do all of this stuff automatically. It doesn't if you nest a linear array on top, as we have. > I'll try not to make you angry :-) I'm not Bruce Banner, so don't worry. ;) > Ok, cool! > Probably some time I will understand how to choose chunck sizes. In the > meantime I will just be happy with the number you provided :-) For your target workloads, finding the "perfect" chunk size isn't critical. What is critical is aligning XFS to the array geometry, and the array to the AF disk geometry, which is, again, why I recommended using bare disks, no partitions. > Btw: I wasn't clear about mythtv. For the recordings I don't use the > raid. I have another disk just for it. > Everyone recommends to not use raids for the recordings. But to be > honest I don't remember the reaosn anymore :-( I've never used MythTV, but it probably has to do with the fact that most MythTV users have 3-4 slow green SATA drives on mobo SATA ports using md RAID5 with the default CFQ elevator. Not a great combo for doing multiple concurrent read/write A/V streams. Using a $300-400 USD 4-8 port RAID controller with 512MB write cache, 4-8 enterprise 7.2k SATA drives in RAID5, and the noop or deadline elevator allows one to do multiple easily. So does using twice as many 7.2k drives in software RAID10 with deadline. Both are far more expensive than simply adding one standalone drive for recording. >> On top of that, because your agcount is way too >> high, XFS will continue creating new dirs and files in the original >> RAID5 array until it fills up. At that point it will write all new >> stuff to the second RAID5. I should have been more clear above. Directories and files would be written to AGs on *both* RAID%s until the first one filled up, then everything would go to AGs on the 2nd RAID5. Above it sounds like the 2nd RAID5 wouldn't be used until the first one filled up, and that's not the case. > The xfs seems really intelligent. So it spreads the load if it can but > it won't copy everything around when a new disk or in my case raid5 is > added? Correct. But it's not "spreading the load". It's simply distributing new directory creation across all available AGs in a round robin fashion. When you grow the XFS, it creates new AGs on the new disk device. After that it simply does what it always does, distributing new directory creation across all AGs until some AGs fill up. This behavior is more static that adaptive, so it's not really all that intelligent. The design is definitely intelligent, and it's one of the primary reasons XFS has such great parallel performance. > But I thought that a green drive lives at least as long as a normal With the first series of WD Green drives this wasn't the case. They had a much higher failure rate. Newer generations are probably much better. And, all of the manufacturers are adding smart power management features to most of their consumer drive lines. > drive or even longer because it *should* wear less because it's more > often asleep. The problem is what is called "thermal cycling". When the spindle motor is spinning the platters at 5-7K RPMS and then shuts down for 30 seconds or more, and then spins up again, the bearings expand and shrink, expand and shrink, very slightly, fractions of a millimeter. But this is enough to cause premature bearing wobble, which affects head flying height, and thus problems with reads/writes, yielding sector errors (bad blocks). This excess bearing wear over time can cause the drive to fail prematurely if the heads begin impacting the platter surface, which is common when bearings develop sufficient wobble. Long before many people on this list were born, systems managers discovered that drives lasted much longer if left running 24x7x365, which eliminated thermal cycling. It's better for drives to run "hot" all the time than to power them down over night and up the next day. 15 years ago, constant running would extend drive live by up to 5 years. With the tighter tolerances of today's drives you may not gain that much. I leave all of my drives running and disable all power savings features on all my systems. I had a pair of 9GB Seagate Barracuda SCSI drives that were still running strong after 14 years of continuous 7.2k RPM service when I decommissioned the machine. They probably won't spin up now that they've been in storage for many years. > If all of this would have been true than I would be willing to pay the > price of less performance and higher raid problem rate. Throttling an idle CPU down to half its normal frequency saves more electricity than spinning down your hard drives, until you have 10 or more, and that depends on which CPU you have. If it's a 130w Intel burner, it'll be more like 15 drives. > But I believe you that the disks don't live as long as normal drives. > So everything is different and I won't buy green drives again :-) I'd "play it by ear". These problems may have been worked out on the newer "green" drives. Bearings can be built to survive this more rapid thermal cycling, those on vehicle wheels do it daily. Once they get the bearings right, these drives should last just as long. > Maybe I will solder a flash light for the LAN LEDs in the front of the > case too :-D Just look at the LEDs on the switch its plugged into. If the switch is on the other side of the room, by a mini switch and set it on top. $10 for a 10/100 and $20 for a GbE switch. USD. > I never was frustrated because of your help. If I was a unhappy it was > only because of my missing knowledge and luck. Well your luck should have changed for the better. You've got all good quality gear now and it should continue to work well together, barring future bugs introduced in the kernel. > If you weren't here to suggest things and help me I would have ended up > with a case that I couldn't use in the worst case. Or one that eats my > data (because of the Supermicro AOC-SASLP-MV8 controllers I initially > had). That controller has caused so many problems for Linux users I cannot believe SM hasn't put a big warning on their site, or simply stopped selling it, replacing it with something that works. Almost of their gear is simply awesome. This one board gives SM a black eye. > In the end I'm very happy and proud of my system. Of course I show it > to my friends and they are jealous for sure :-) That's great. :) > So thanks very much again and please let me know how I can buy you a > beer or two! As always, you're welcome. And sure, feel free to donate to my beer fund. ;) -- Stan -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4fdb73d5.90...@hardwarefreak.com