Re: Questions on nilfs_cleanerd
These sound similar to the questions/concerns I raised a while back. 1) Does the daemon read/write the entire drive to look for dead blocks to clean? Yes - sequentially (and then it rolls over to the beginning again). 2) What if there aren't any dead blocks to clean and the free space in the drive is still less than 10% (the default min_clean_segments in the conf file), does the daemon still process the drive? If so, how do I change the cleaning interval so that it doesn't process the drive as often? The only worthwhile suggestion I heard is to set the minimum history retention rate (the FS is continuously snapshotting) to 1 day. That way you can guarantee the churn rate will never exceed the capacity of the disk per day. Not ideal, but at least it puts some kind of a hard limit on how quickly it'll waste your flash - at the expense of making the problem of the non-determinism of free space a little worse. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Garbage Collection Method
On 01/27/2012 06:47 PM, Christian Smith wrote: On Fri, Jan 27, 2012 at 04:26:23PM +, Gordan Bobic wrote: Christian, Many thanks for your reply. 1) Does it scan blocks from the tail of the file system forward sequentially? Yes 2) Does it reclaim blocks regardless of how dirty they are? Or does it execute reclaiming on order of maximum dirtyness first in order to reduce churn (and flash wear when used on flash media)? The former. 3) What happens when it encounters a block that isn't dirty? Does it skip it and reclaim the next dirty block, leaving a "hole"? Or does it reclaim everything up to a reclaimable block to make the free space contiguous? It is cleaned regardless. Free space appears to always be contiguous. Hmm, so the GC causes completely unnecessary flash wear. That's really bad for the most advantageous use-case of nilfs2. :( I work round it by setting my protection period to about 1 day, so I know that the whole device will not be written more than once per day. Even with 3000 p/e cycle FLASH, that's eight years of use. Hmm... So the GC is smart enough to stop reclaiming if the next block to be checked has a time stamp that is recent enough? I find the biggest advantage of NILFS is avoiding the random small writes that so quickly wear cheap flash out. Even with the GC, I'd wager NILFS still beats ext3 (say) at avoiding write amp due to it's more sequential write nature. Not to mention the performance gains as a result. Random writes are so slow because each random write might be doing a full block erase, which is also why their write amp is so bad in the first place. But hey, they're cheap and designed for camera like write patterns ( writing big files in long contiguous chunks.) Except they are also the main envisaged storage medium for things like ARM machines, most of which are capable of replacing an average desktop box if only they didn't lack SATA for a proper SSD. 4) Assuming this isn't already how it works, how difficult would it be to modify the reclaim policy (along with associated book-keeping requirements) to reclaim blocks in the order of dirtiest-block-first? 5) If a suitable book-keeping bitmap was in place for 4), could this not be used for accurate df reporting? Not being a NILFS developer, I can't answer either of these in detail. However, as I understand it, the filesystem driver does not depend on the current cleaning policy, and can skip cleaning specific blocks should those blocks be sufficiently clean. Segments need not be written sequentially, as each segment contains a pointer to the next segment that will be written and hence why lssu always lists two segments as active (the current segment and the next segment to be written). It's just that the current GC just cleans all segments sequentially. It's easier to just cycle through the segments in a circular fashion. I see, so the sub-optimal reclaim and unnecessary churn are purely down to the userspace GC daemon? Is there scope for having a bitmap or a counter in each allocation unit to show how many dirty blocks there are in it? Such a bitmap would require 1MB of space for every 32GB of storage (assuming 1 bit per 4KB block). This would allow for being able to tell at a glance which block is dirties and thus should be reclaimed next, while at the same time stopping unnecessary churn. Is 1 bit enough? At what point do you turn the bit on? Half dead segment? I can't see 1 bit being useful enough to make the overhead worthwhile. Also, we're not just talking about live current data. There is also snapshot and checkpoint visible data to consider. Not easy to represent with a bitmap. I'm talking about 1 bit per 4KB block. Hence 1MB per 32GB. Since the smallest write size is always going to be 1 block (4KB), there is no need to track smaller units. And it also means that a single 4KB block is either clean or dirty, and nothing inbetween. What would be useful is to be able to select the write segment into which the cleaner will write live data. That way, the system could maintain two log "heads", one for active hot data, and one for inactive cold data. Then all cleaning would be done to the cold head, and all new writes to the hot head on the assumption that the new write will either be temporary (and hence discarded sooner rather than later) or not be updated for some time (and hence cleaned to a cold segment by the cleaner) with the hope that we'll have a bimodal distribution of clean and dirty data. Then the cleaner can concentrate on cleaning hot segments, with the occasional clean of cold segments. I don't think distinguishing between hot and cold data is all that useful. Ultimately, the optimal solution would be to reclaim the AUs in dirtiest-first order. The other throttling provisions (not reclaiming until free space drops below a threshold) should do eno
Re: Odd problem starting nilfs_cleanerd due to an eMMC misbehaviour
Christian Smith wrote: On Thu, Jan 26, 2012 at 05:52:03PM +0400, Paul Fertser wrote: Hi, I'm using nilfs2 for the root filesystem on an ARM-based netbook (Toshiba ac100) with Debian hardfloat. Custom kernel is based on 3.0.8 and nilfs-tools is 2.1.0-1 from the Debian repository. I wanted to try the threaded i/o test from the Phoronix test suite and somehow it happened that during the test the garbage collecting daemon failed and never came back. So i got the filesystem 100% full and after i noticed it i tried running the daemon manually. It didn't start even after reboot. Suprisingly, the eMMC error went away on its own after fully powering off the whole device, and after that the daemon started to work properly. I'm not sure what conclusion might be made from this but i'd still appreciate any comments, especially the suggestions on what to do if the error didn't "recover". Remember, SDCards contain their own embedded controller to do the block mapping between LBA and FLASH blocks. There may even be an ARM based controller in the SDCard. Under the stress of a benchmark, the firmware probably just got itself in a bit of a state and needed a hard reset to recover. What brand of SD Card is it? Most SD Cards are designed for low stress low speed IO in devices such as cameras. Perhaps try a different brand. I believe Paul was referring to the internal eMMC (not an SD card) on the Toshiba AC100. Not something that is easily replaceable. :( I should also point out that having benchmarked many SD cards, I have yet to find any that offer decent performance on random-writes, no matter how good they may be at linear writes - hence the interest in nilfs2. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Garbage Collection Method
Christian, Many thanks for your reply. 1) Does it scan blocks from the tail of the file system forward sequentially? Yes 2) Does it reclaim blocks regardless of how dirty they are? Or does it execute reclaiming on order of maximum dirtyness first in order to reduce churn (and flash wear when used on flash media)? The former. 3) What happens when it encounters a block that isn't dirty? Does it skip it and reclaim the next dirty block, leaving a "hole"? Or does it reclaim everything up to a reclaimable block to make the free space contiguous? It is cleaned regardless. Free space appears to always be contiguous. Hmm, so the GC causes completely unnecessary flash wear. That's really bad for the most advantageous use-case of nilfs2. :( 4) Assuming this isn't already how it works, how difficult would it be to modify the reclaim policy (along with associated book-keeping requirements) to reclaim blocks in the order of dirtiest-block-first? 5) If a suitable book-keeping bitmap was in place for 4), could this not be used for accurate df reporting? Not being a NILFS developer, I can't answer either of these in detail. However, as I understand it, the filesystem driver does not depend on the current cleaning policy, and can skip cleaning specific blocks should those blocks be sufficiently clean. Segments need not be written sequentially, as each segment contains a pointer to the next segment that will be written and hence why lssu always lists two segments as active (the current segment and the next segment to be written). > It's just that the current GC just cleans all segments sequentially. It's easier to just cycle through the segments in a circular fashion. I see, so the sub-optimal reclaim and unnecessary churn are purely down to the userspace GC daemon? Is there scope for having a bitmap or a counter in each allocation unit to show how many dirty blocks there are in it? Such a bitmap would require 1MB of space for every 32GB of storage (assuming 1 bit per 4KB block). This would allow for being able to tell at a glance which block is dirties and thus should be reclaimed next, while at the same time stopping unnecessary churn. What would be useful is to be able to select the write segment into which the cleaner will write live data. That way, the system could maintain two log "heads", one for active hot data, and one for inactive cold data. Then all cleaning would be done to the cold head, and all new writes to the hot head on the assumption that the new write will either be temporary (and hence discarded sooner rather than later) or not be updated for some time (and hence cleaned to a cold segment by the cleaner) with the hope that we'll have a bimodal distribution of clean and dirty data. Then the cleaner can concentrate on cleaning hot segments, with the occasional clean of cold segments. I don't think distinguishing between hot and cold data is all that useful. Ultimately, the optimal solution would be to reclaim the AUs in dirtiest-first order. The other throttling provisions (not reclaiming until free space drops below a threshold) should do enough to stop premature flash wear. Accurate df reporting is more tricky, as checkpoints and snapshots make it decidedly not trivial to account for overwritten data. As such, the current df reporting is probably the best we can manage within the current constraints. With the bitmap solution as described above, would we not be able to simply subtract the dirty blocks from the used space? Since the bitmap always contains the dirtyness information on all the blocks in the FS, this would make for a pretty simple solution, would it not? Is there anything in place that would prevent such a bitmap from being kept in the file system headers? It could even be kept in RAM and generated by the garbage collector for it's own use at run-time, thinking about it, 1MB per 32GB is not a lot (32MB per TB), and it could even be run-length encoded. Right now, even just preventing reallocation of allocation units that are completely clean would be a big advantage in terms of performance and flash wear. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Garbage Collection Method
Hi, Quick question about the garbage collector and what it reclaims and in what order. 1) Does it scan blocks from the tail of the file system forward sequentially? 2) Does it reclaim blocks regardless of how dirty they are? Or does it execute reclaiming on order of maximum dirtyness first in order to reduce churn (and flash wear when used on flash media)? 3) What happens when it encounters a block that isn't dirty? Does it skip it and reclaim the next dirty block, leaving a "hole"? Or does it reclaim everything up to a reclaimable block to make the free space contiguous? 4) Assuming this isn't already how it works, how difficult would it be to modify the reclaim policy (along with associated book-keeping requirements) to reclaim blocks in the order of dirtiest-block-first? 5) If a suitable book-keeping bitmap was in place for 4), could this not be used for accurate df reporting? TIA. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cache Churn
On Tue, 23 Aug 2011 14:38:49 +0900 (JST), Ryusuke Konishi wrote: Hi, On Wed, 10 Aug 2011 12:17:45 +0100, Gordan Bobic wrote: Another performance related problem I am seeing due to nilfs_cleanerd is that it causes unhealthy amounts of cache churn. It's reads and writes are buffered, which inevitably means that things it reads will get cached. Since it is going through all the blocks on the fs that have any garbage to collect, it will eat through all the available memory pretty quickly. It also means that it will push out of caches things that really should stay in caches. Interesting report. nilfs_cleanerd only reads log header and does not read payload blocks. Data blocks are instead read and copied by the nilfs kernel code, and they are freed every time reclamation call of a few segments has ended. I guess the abnormal cache churn arose from other causes, seems that DAT file access is suspicious. (The DAT file holds metadata used to convert virtual block addresses to real disk block addresses). Since cleanerd's actual disk I/O is going to have no correlation with actual file access pattern, is there a way to make cleanerd always operate with something like the O_DIRECT flag so that is's reads won't fill up the page cache? If the problem comes from internal metadata accesses like the DAT file access, O_DIRECT is not applicable. This is a pretty serious problem on small machines running of cheap flash (think ARM machines with 512MB of RAM and slow flash media). The quick and dirty workaround I am pondering at the moment is to set up a cron job that runs once/minute, checks df, and starts/kills nilfs_cleanerd depending on how much free space is available, but that's not really a solution. Gordan Does your kernel version equal to or newer than v2.6.37 ? I am running 2.6.38.8 + chromos patches (running on Tegra2 ARM). Last year, we changed cache usage for the DAT file on that kernel. This might influence the issue. I am running 2.0.23 nilfs-utils. The cache churn issue is trivial to reproduce: 1) On an otherwise idle machine, set the thresholds appropriately to make nilfs_cleanerd reclaim some space 2) echo 3 > /proc/sys/vm/drop_caches 3) Observe top and iotop to establish that: - nilfs_cleanerd is the only thing running and doing anything - cache memory is growing at the same rate at which iotop is saying nilfs_cleanerd is doing I/O Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More on nilfs_cleanerd and excessive writes (1 month flash card life expetancy)
As of which version? I'm running 2.0.15. Gordan On Fri, 12 Aug 2011 09:51:03 -0400, Jérôme Poulin wrote: I do not know what version of the NILFS-Tools you're using but the latest is configurable in this way, that it will only clean when space is critical. Envoyé de mon appareil mobile. Jérôme Poulin Solutions G.A. On 2011-08-12, at 06:38, Gordan Bobic wrote: I just did some basic measuring and it looks like the total writes by nilfs_cleanerd on my SD card total about 1GB/minute (16MB/second, all my card can handle). Since the system is used all the time while it is on, that involves there always being things that need to be garbage collected, so it runs all the time. Even assuming it performance isn't an issue (running at nice 19 and ionice -c3, and performance IS an issue), that still means that the SD card will get 1,440GB of writes/day (1,4TB!). It's a 32GB MLC flash card, so assuming a 5,000 erase cycle life of 32nm MLC (ignoring any inevitable write amplification), that gives life expectancy of 160TB, or at the given rate of nilfs_cleanerd churn, about 12 days of usage. Call it a month with the assumption the machine isn't used all day every day. This is quite thoroughly unacceptable for usage on any flash media. Ignoring any other optimizations that might be applicable (e.g. smaller block size to minimize the number of blocks that have to be re-written), my immediate redneck solution is running this every minute as a cron job: == #!/bin/bash # Substitute /dev/mmcblk1p4 for your nilfs partition used=`df | grep /dev/mmcblk1p4 | awk '{ print $5; }' | sed -e 's/%//'` # If disk usage is more than 90%... if [ $used -gt 90 ]; then # If nilfs_cleanerd is not running... if (! pgrep nilfs_cleanerd > /dev/null ); then nohup nice -n 19 ionice -c 3 /sbin/nilfs_cleanerd > /dev/null 2>&1 & fi # If disk usage is less than 90%... elif [ $used -lt 80 ]; then pkill nilfs_cleanerd > /dev/null 2>&1 fi == This could of course be improved and "enterpriseified" further, e.g. check for all nilfs partitions and do the checks on all of them, make the free space amount thresholds based on 1/3 and 2/3 of free space (fs size - du), but this problem shouldn't really be looking for a solution in a cron job. It's not ideal and nilfs_cleanerd should be configurable to moderate itself in a similar way, but until that happens, I don't see any alternative to the above cron job. The write performance is fantastic for tasks that do a lot of writing, but the life expectancy issue is a very real one. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
More on nilfs_cleanerd and excessive writes (1 month flash card life expetancy)
I just did some basic measuring and it looks like the total writes by nilfs_cleanerd on my SD card total about 1GB/minute (16MB/second, all my card can handle). Since the system is used all the time while it is on, that involves there always being things that need to be garbage collected, so it runs all the time. Even assuming it performance isn't an issue (running at nice 19 and ionice -c3, and performance IS an issue), that still means that the SD card will get 1,440GB of writes/day (1,4TB!). It's a 32GB MLC flash card, so assuming a 5,000 erase cycle life of 32nm MLC (ignoring any inevitable write amplification), that gives life expectancy of 160TB, or at the given rate of nilfs_cleanerd churn, about 12 days of usage. Call it a month with the assumption the machine isn't used all day every day. This is quite thoroughly unacceptable for usage on any flash media. Ignoring any other optimizations that might be applicable (e.g. smaller block size to minimize the number of blocks that have to be re-written), my immediate redneck solution is running this every minute as a cron job: == #!/bin/bash # Substitute /dev/mmcblk1p4 for your nilfs partition used=`df | grep /dev/mmcblk1p4 | awk '{ print $5; }' | sed -e 's/%//'` # If disk usage is more than 90%... if [ $used -gt 90 ]; then # If nilfs_cleanerd is not running... if (! pgrep nilfs_cleanerd > /dev/null ); then nohup nice -n 19 ionice -c 3 /sbin/nilfs_cleanerd > /dev/null 2>&1 & fi # If disk usage is less than 90%... elif [ $used -lt 80 ]; then pkill nilfs_cleanerd > /dev/null 2>&1 fi == This could of course be improved and "enterpriseified" further, e.g. check for all nilfs partitions and do the checks on all of them, make the free space amount thresholds based on 1/3 and 2/3 of free space (fs size - du), but this problem shouldn't really be looking for a solution in a cron job. It's not ideal and nilfs_cleanerd should be configurable to moderate itself in a similar way, but until that happens, I don't see any alternative to the above cron job. The write performance is fantastic for tasks that do a lot of writing, but the life expectancy issue is a very real one. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cache Churn
Another performance related problem I am seeing due to nilfs_cleanerd is that it causes unhealthy amounts of cache churn. It's reads and writes are buffered, which inevitably means that things it reads will get cached. Since it is going through all the blocks on the fs that have any garbage to collect, it will eat through all the available memory pretty quickly. It also means that it will push out of caches things that really should stay in caches. Since cleanerd's actual disk I/O is going to have no correlation with actual file access pattern, is there a way to make cleanerd always operate with something like the O_DIRECT flag so that is's reads won't fill up the page cache? This is a pretty serious problem on small machines running of cheap flash (think ARM machines with 512MB of RAM and slow flash media). The quick and dirty workaround I am pondering at the moment is to set up a cron job that runs once/minute, checks df, and starts/kills nilfs_cleanerd depending on how much free space is available, but that's not really a solution. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nilfs_cleanerd using a lot of disk-write bandwidth
On Tue, 9 Aug 2011 17:19:01 +0200, dexen deVries wrote: On Tuesday 09 of August 2011 14:25:07 you wrote: Interesting. I still think something should be done to minimize the amount of writes required. How about something like the following. Divide situations into 3 classes (thresholds should be adjustable in nilfs_cleanerd.conf): 1) Free space good (e.g. space >= 25%) Don't do any garbage collection at all, unless an entire block contains only garbage. 2) Free space low (e.g. 10% < space < 25%) Run GC as now, with the nice/ionice applied. Only GC blocks where $block_free_space_percent >= $disk_free_space_percent. So as the disk space starts to decrease, the number of blocks that get considered for GC increase, too. 3) Free space critical (e.g. space < 10%) As 2) but start decreasing niceness/ioniceness (niceness by 3 for every 1% drop in free space, so for example: 10% - 19 ... 7% - 10 ... 4% - 1 3% - -2 ... 1% - -8 This would give a very gradual increase in GC aggressiveness that would both minimize unnecessary writes that shorted flash life and provide a softer landing in terms of performance degradation as space starts to run out. The other idea that comes to mind on top of this is to GC blocks in order of % of space in the block being reclaimable. That would allow for the minimum number of blocks to always be GC-ed to get the free space above the required threshold. Thoughts? Could end up being too slow. A 2TB filesystem has about 260'000 segments (given the default size of 8MB). cleanerd already takes quite a bit of CPU power at times. Also, cleanerd can do a lot of HDD seeks, if some parts of metadata aren't in cache. Performing some 260'000 seeks on a harddrive would take anywhere from 1000 to 3000 seconds; that not very interactive. Actually, it gets dangerously close to an hour. However, if the cleanerd did not have to follow this exact algorithm, but instead id something roughly similar (heueristics rather than algorithm), it could be good enough. Well, you could adjust all the numbers in the algorithm. :) As an aside, why would you use nilfs on a multi-TB FS? What's the advantage? The way I see it the killer application for nilfs is slow flash media with (probably) poorly implemented wear leveling. The idea of the above is that you don't end up suffering poor disk performance due to background clean-up until you actually have a plausible chance of running out of space. What is the point of GC-ing if there is already 80% of empty space ready for writing to? All you'll be doing is making the fs slow for no obvious gain. Possibly related, I'd love if cleanerd tented to do some mild de-fragmentation of files. Not necessarily full-blown, exact defragmentation, just placing quite stuff close together. If it's garbage collecting involves reading a block and re-writing it without the deleted data, then isn't that already effectively defragmenting the fs? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nilfs_cleanerd using a lot of disk-write bandwidth
On Tue, 9 Aug 2011 13:03:54 +0200, dexen deVries wrote: Hi Gordan, On Tuesday 09 of August 2011 12:18:12 you wrote: I'm seeing nilfs_cleanerd using a lot of disk write bandwidth according to iotop. It seems to be performing approximately equal amounts of reads and writes when it is running. Reads I can understand, but why is it writing so much in order to garbage collect? Should it not be just trying to mark blocks as free? The disk I/O r/w symmetry implies that it is trying to do something like defragment the file system. Is there a way to configure this behaviour in some way? The main use-case I have for nilfs is cheap flash media that suffers from terrible random-write performance, but on such media this many writes are going to cause media failure very quickly. What can be done about this? I'm not a NILFS2 developer, so don't rely too much on the following remarks! NILFS2 consider filesystem as a (wrapped around) list of segments, by default each 8MB. Those segments contain both file data and metadata. cleanerd operates on whole segments; normally either 2 or 4 in one pass (depending on remaining free space). It seems to me a segment is reclaimed when there is any amount of garbage in it, no matter how small. Thus you see, in some cases, about as much of read as of write. One way could be be to make cleanerd configurable so it doesn't reclaim segments that have only very little garbage in them. That would probably be a trade-off between wasted diskspace and lessened bandwidth use. As for wearing flash media down, I believe NILFS2 is still very good for them, because it tends to write in large chunks -- much larger than the original 512B sector -- and not over-write once written areas (untill reclaimed by cleanerd, often much, much later). Once the flash' large erase unit is erased, NILFS2 append-writes to it, but not over-writes already written data. Which means the flash is erased almost as little as possible. Interesting. I still think something should be done to minimize the amount of writes required. How about something like the following. Divide situations into 3 classes (thresholds should be adjustable in nilfs_cleanerd.conf): 1) Free space good (e.g. space >= 25%) Don't do any garbage collection at all, unless an entire block contains only garbage. 2) Free space low (e.g. 10% < space < 25%) Run GC as now, with the nice/ionice applied. Only GC blocks where $block_free_space_percent >= $disk_free_space_percent. So as the disk space starts to decrease, the number of blocks that get considered for GC increase, too. 3) Free space critical (e.g. space < 10%) As 2) but start decreasing niceness/ioniceness (niceness by 3 for every 1% drop in free space, so for example: 10% - 19 ... 7% - 10 ... 4% - 1 3% - -2 ... 1% - -8 This would give a very gradual increase in GC aggressiveness that would both minimize unnecessary writes that shorted flash life and provide a softer landing in terms of performance degradation as space starts to run out. The other idea that comes to mind on top of this is to GC blocks in order of % of space in the block being reclaimable. That would allow for the minimum number of blocks to always be GC-ed to get the free space above the required threshold. Thoughts? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
nilfs_cleanerd using a lot of disk-write bandwidth
Hi, I'm seeing nilfs_cleanerd using a lot of disk write bandwidth according to iotop. It seems to be performing approximately equal amounts of reads and writes when it is running. Reads I can understand, but why is it writing so much in order to garbage collect? Should it not be just trying to mark blocks as free? The disk I/O r/w symmetry implies that it is trying to do something like defragment the file system. Is there a way to configure this behaviour in some way? The main use-case I have for nilfs is cheap flash media that suffers from terrible random-write performance, but on such media this many writes are going to cause media failure very quickly. What can be done about this? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Applying nice/ionice to nilfs-cleanerd
On Mon, 8 Aug 2011 10:09:31 +0200, dexen deVries wrote: [lowering cleanerd priority] However, may be a downside to /always/ running cleanerd niced and ioniced. I believe that currently cleanerd's activity slows other processes down a lot when filesystem is almost full -- which means that it oftet won't become truly full, because clearned will free enough space for other processes to be able to complete their work. If, on the other hand, cleanerd was highly niced and ioniced, it could end up being starved of CPU and disk bandwidth and not freeing enough free space, which could cause other processes to exhaust free space on filesystem and abord when not able to write to filesystem. I was just thinking about that. This would only be an issue on a system that is either very constrained in terms of disk space or is never idle. though. Perhaps it would be enough to have cleanerd automatically switch priority based on available free space. For example, if I had min_clean_segments 10% max_clean_segments 12% then also have min_clean_segments_low_prio 8% low_prio_nice 19 normal_prio_nice 0 low_prio_ionice_class idle normal_prio_ionice_class realtime which woud mean, `use low priority (nice & ionice) when there's at least 8% of free segments; if there's less use higher priority' -- so cleanerd would reclaim free space more aggressively when there's little free space left. I was thinking about something similar. Realtime ionice is OTT, though, I don't think it should ever be ioniced over normal. But yes, I think this would be a good idea. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Applying nice/ionice to nilfs-cleanerd
On 08/08/2011 12:23 AM, Ryusuke Konishi wrote: Is there a way to set default nice/ionice levels for nilfs-cleanerd? At present, you have to manually invoke the cleanerd through the nice/ionice commands or to run renice/ionice later specifying the process ID of the cleanerd. One way to make this convenient is introducing new directives in /etc/nilfs_cleanerd.conf as follows: # Scheduling priority. nice 19# niceness -20~19 # IO scheduling class. # Supported classes are default, idle, best-effort, and realtime. ionice_class idle # IO scheduling priority. # 0-7 is valid for best-effort and realtime classes. ionice_data 5 Do you think these extensions make sense ? Yes, I think those would be really handy. It would also mean that the cleanerd could be scheduled to run more aggressively but at lower priority, so the clean-up would be potentially more up to date while having less impact on the system performance. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Applying nice/ionice to nilfs-cleanerd
Hi, Is there a way to set default nice/ionice levels for nilfs-cleanerd? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
David Arendt wrote: 4) As the data gets expired, and snapshots get deleted, this will inevitably lead to fragmentation, which will de-linearize writes as they have to go into whatever holes are available in the data. How does this affect nilfs write performance? For now, my understanding, nilfs garbage collector moves the live (in use) blocks to the end of logs, so holes are not created (it is correct?). However, it leads another issue that garbage collector process, which is nilfs_cleanerd, will consume the I/O. This is major I/O performance bottle neck current implementation. Since this moves files, it sounds like this could be a major issue for flash media since it unnecessarily creates additional writes. Can this be suppressed? You can simply kill the nilfs_clearnerd after you mount the nilfs partition. If you use the latest nilfs_utils, killing nilfs_cleanerd is no longer necessary. You can use mount -o nogc. This will not start nilfs_cleanerd. Another possibility is to let nilfs_cleanerd start and tweak min_free_segments and max_free_segments so that cleanerd will only do cleaning if necessary. What about making the gc run only if the disk has been idle for, say, 20ms, unless min_free_segments is reached? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
Jiro SEKIBA wrote: 2) Mechanical disks suffer from slow random writes (or any random operation for that matter), too. Do the benefits of nilfs show in random write performance on mechanical disks? I think it may have benefits, for nilfs will write sequentially whatever data is located before writing it. But still some tweaks might be required to speed up compared with ordinary filsystem like ext3. Can you quantify what those tweaks may be, and when they might become available/implemented? I might choose the wrong word, but what I meant is more hack is required to improve write performance. Not just configuration matters :(. I understand what you meant. I just wanted to know when those hacks may be implemented and be available for those of us interested in using nilfs to optimize write-heavy workloads. 3) How does this affect real-world read performance if nilfs is used on a mechanical disk? How much additional file fragmentation in absolute terms does nilfs cause? The data is scattered if you modified the file again and again, but it'll be almost sequential at the creation time. So it will affect much if files are modified frequently. Right. So bad for certain tasks, such as databases. Indeed. maybe /var type of directories too. Interesting. So nilfs' suitability for write heavy loads is actually quite limited on mechanical disks, as it isn't suitable for append-heavy situations such as databases and logging, but for use-cases that are write+delete heavy such as mail servers or other spool type loads it should still be advantageous. 4) As the data gets expired, and snapshots get deleted, this will inevitably lead to fragmentation, which will de-linearize writes as they have to go into whatever holes are available in the data. How does this affect nilfs write performance? For now, my understanding, nilfs garbage collector moves the live (in use) blocks to the end of logs, so holes are not created (it is correct?). However, it leads another issue that garbage collector process, which is nilfs_cleanerd, will consume the I/O. This is major I/O performance bottle neck current implementation. Since this moves files, it sounds like this could be a major issue for flash media since it unnecessarily creates additional writes. Can this be suppressed? You can simply kill the nilfs_clearnerd after you mount the nilfs partition. This case, of course, any garbage is reclaimed and finally end up with disk full, even size of files don't occupy the storage size. I don't have data for now, but it made about twice better write performance compared with "with garbage collector". What about enabling garbage collection, but disabling degragmentation? De-allocating space that isn't used any more is a necessary evil, but defragmentation is rather pointless in a lot of cases (e.g. SSDs) and counter-productive in others (mechanical disks under heavy load). Also, what about making the garbage collector "lazy", so that it runs either just-in time to overwrite discarded data (worst case scenario) or runs when the disks are idle (e.g. at ionice -c3, and even that only when there have been no disk transactions for, some selectable number of ms)? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
This thread will continue off list because it seems to have lost all relevance to nilfs. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
Vincent Diepeveen wrote: 1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, so that the writes happen sequentially anyway. Could you explain that, as far as i know modern SSD's have 8 independant channels to do read and writes, which is why they are having that big read and write speed and can in theory therefore support 8 threads doing reads and writes. Each channel say using blocks of 4KB, so it's 64KB in total. I'm talking about something else. I'm talking about the fact that you can turn logical random writes into physical sequential writes by re-mapping logical blocks to sequential physical blocks. That's doing 2 steps back in history isn't it? Sorry, I don't see what you mean. Can you elaborate? I didn't investigate NILFS, but under all conditions what you want to avoid is some sort of central locking of the file system, because if you're proposing all sorts of fancy stuff to the file system whereas you can already do your thing using full bandwidth of the SSD. Are you actually claiming that you can achieve full write throughput on random writes that you can achieve on sequential writes on an SSD? Try that with write caches on the drive disabled. It really is interesting to have a file system where you do a minimum number of actions to the file system so that other threads can do there work there. Any complicated datastructure manipulation that requires central locking or other forms of complicated locking will limit other i/o actions. I agree. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
Vincent Diepeveen wrote: The big speedup that SSD's deliver for average usage is ESPECIALLY because of the faster random access to the hardware. Sure - on reads. Writes are a different beast. Look at some reviews of SSDs of various types and generations. Until relatively recently, random write performance (and to a large extent, any write performance) on them has been very poor. Cheap flash media (e.g. USB sticks) still suffers from this. You wouldn't want to optimize a file system for hardware of the past is it? > Before a file system is any mature, the hardware that is the standard today will be very common. There are a few problems with that line of reasoning. 1) Legacy support is important. If it wasn't, file systems would be strictly in the realm of fixed disk manufacturers, and we would all be using object based storage. This hasn't happened, nor is it likely to in the next decade. 2) We cannot optimize for hardware of the future, because this hardware may never arrive. 3) "Hardware of the past" is still very much in full production, and isn't going away any time soon. The only sane option is to optimize for what is prevalent right now. if you have some petabytes of storage, i guess the bigger bandwidth that SSD's deliver is not relevant, as the limitation is the network bandwidth anyway, so some raid5 with extra spare will deliver more than sufficient bandwidth. RAID3/4/5/6 is inherently unsuitable for fast random writes because if a write-read-write cycle required to update the parity. Nearly all major supercomputers use raid5 with extra spare as well as most database servers. Can you quantify that bold statement? I would expect vastly higher levels of RAID than RAID5 on supercomputers, because RAID5 doesn't scale sufficiently. RAID6 is a bit better, but still doesn't really scale. It comes down to data error rates on disks. RAID5 with current error rates tops out at about 6-8TB, which is pitifully small on the supercomputer scale. Anybody deploying RAID5 on high-performance database servers that are expected to have more than about 1% write:read ratio has no business being a database administrator, IMO. Then again the fact that I have managed to optimize the performance of most systems I've been called to provide consultancy on by factors of between 10 and 1000 without requiring any new hardware shows me that the industry is full of people who haven't got a clue what they are doing. Stock exchange is more into raid10 type clustering, but those few harddrives that the stock exchange uses, is that relevant? You're pulling examples out of the air, and it is difficult to discuss them without in-depth system design information. And I doubt you have access to that level of the system design information of stock exchange systems unless you work for one. Do you? So a file system should benefit from the special properties of a SSD to be suited for this modern hardware. The only actual benefit is decreased latency. Which is mighty important; so the ONLY interesting type of filesystem for a SSD is a filesystem that is optimized for read and write latency rather than bandwidth IMHO. Indeed, I agree (up to a point). Random IOPS has long been the defining measure of disk performance for a reason. I'm always very careful saying a benchmark is holy. Most aren't, but every once in a while a meaningful one comes up. Random IOPS one is one such (relatively rare) example. Especially read latency i consider most important. Depends on your application. Remember that reads can be sped up by caching. Even relative simple caching is very difficult to improve, with random reads. The random read speed is of overwhelming influence. 20 years of experience in high-performance applications, databases and clusters showed me otherwise. Random read speed is only an issue until your caches are primed, or if your data set is sufficiently big to overwhelm any practical amount of RAM you could apply. I look after a number of systems running applications that are write-bound because the vast majority of reads can be satisfied from page cache, but writes are unavoidable because transactions have to be committed to persistent storage. You're assuming the working set size fits in caching, which is a very interesting assumption. Not necessarily the whole working set, but a decent chunk of it, yes. If it doesn't, you probably need to re-assess what you're trying to do. For example, on databases, as a rule of thumb you need to size your RAM so that all indexes aggregated fit into 50-75% of your RAM. The rest of the RAM is used for page caches for the actual data. To put it into a different perspective - a typical RHEL server install is 5-6GB. That fits into the RAM on the machine on my desk, and almost fits into the RAM of the laptop on typing up this email on. If your working set is measured in petabytes, then you are probably
Re: SSD and non-SSD Suitability
Vincent Diepeveen wrote: 1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, so that the writes happen sequentially anyway. Could you explain that, as far as i know modern SSD's have 8 independant channels to do read and writes, which is why they are having that big read and write speed and can in theory therefore support 8 threads doing reads and writes. Each channel say using blocks of 4KB, so it's 64KB in total. I'm talking about something else. I'm talking about the fact that you can turn logical random writes into physical sequential writes by re-mapping logical blocks to sequential physical blocks. That's doing 2 steps back in history isn't it? Sorry, I don't see what you mean. Can you elaborate? The big speedup that SSD's deliver for average usage is ESPECIALLY because of the faster random access to the hardware. Sure - on reads. Writes are a different beast. Look at some reviews of SSDs of various types and generations. Until relatively recently, random write performance (and to a large extent, any write performance) on them has been very poor. Cheap flash media (e.g. USB sticks) still suffers from this. Don't confuse fast random reads with fast random writes. if you have some petabytes of storage, i guess the bigger bandwidth that SSD's deliver is not relevant, as the limitation is the network bandwidth anyway, so some raid5 with extra spare will deliver more than sufficient bandwidth. RAID3/4/5/6 is inherently unsuitable for fast random writes because if a write-read-write cycle required to update the parity. So a file system should benefit from the special properties of a SSD to be suited for this modern hardware. The only actual benefit is decreased latency. Which is mighty important; so the ONLY interesting type of filesystem for a SSD is a filesystem that is optimized for read and write latency rather than bandwidth IMHO. Indeed, I agree (up to a point). Random IOPS has long been the defining measure of disk performance for a reason. Especially read latency i consider most important. Depends on your application. Remember that reads can be sped up by caching. I look after a number of systems running applications that are write-bound because the vast majority of reads can be satisfied from page cache, but writes are unavoidable because transactions have to be committed to persistent storage. You cannot limit your performance assessment to the use-case of an average desktop user running Firefox, Thunderbird and OpenOffice 99% of the time. Those are not the users that file systems advances of the past 30 years are aimed at. Of course i understand you skip ext4 as that obviously still has to get bugfixed. It seems to be deemed stable enough for several distros, and will be the default in RHEL6 in a few months' time, so that's less of a concern. I ran into severe problems with ext4 and i just used it at 1 harddrive, same experiences with other linux users. How recently have you tried it? RHEL6b has only been out for a month. Note i used ubuntu. I guess that explains some of your desktop-centric views. Stuff like RHEL is more expensive a copy than i have at my bank account. RHEL6b is a public beta, freely downloadable. CentOS is a community recompile of RHEL, 100% binary compatible, just with different artwork/logos. Freely available. As is Scientific Linux (a very similar project to CentOS, also a free recompile of RHEL). If you haven't found them, you can't have looked very hard. I am more interested in metrics for how much writing is required relative to the amount of data being transferred. For example, if I am restoring a full running system (call it 5GB) from a tar ball onto nilfs2, ext2, ext3, btrfs, etc., I am interested in how many blocks worth of writes actually hit the disk, and to a lesser extent how many of those end up being merged together (since merged operations, in theory, can cause less wear on an SSD because bigger blocks can be handle more efficiently if erasing is required. The most efficient blocksize for SSD's is 8 channels of 4KB blocks. I'm not going to bite and get involved in debating the correctness of this (somewhat limited) view. I'll just point out that it bears very little relevant to the paragraph that it appears to be responding to. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
Jiro SEKIBA wrote: I haven't got any particular quantitative data by my own, so I'll write somewhat subjective opinion. Thanks, I appreciate it. :) I've got a somewhat broad question on the suitability of nilfs for various workloads and different backing storage devices. From what I understand from the documentation available, the idea is to always write sequentially, and thus avoid slow random writes on old/naive SSDs. Hence I have a few questions. 1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, so that the writes happen sequentially anyway. Does nilfs demonstrably provide additional benefits on such modern SSDs with sensible firmware? In terms of writing performance, it may not have additional benefits I guess. However, it still have benefits with regard to continuous snapshots. How does this compare with btrfs snapshots? When you say continuous, what are the breakpoints between them? 2) Mechanical disks suffer from slow random writes (or any random operation for that matter), too. Do the benefits of nilfs show in random write performance on mechanical disks? I think it may have benefits, for nilfs will write sequentially whatever data is located before writing it. But still some tweaks might be required to speed up compared with ordinary filsystem like ext3. Can you quantify what those tweaks may be, and when they might become available/implemented? 3) How does this affect real-world read performance if nilfs is used on a mechanical disk? How much additional file fragmentation in absolute terms does nilfs cause? The data is scattered if you modified the file again and again, but it'll be almost sequential at the creation time. So it will affect much if files are modified frequently. Right. So bad for certain tasks, such as databases. 4) As the data gets expired, and snapshots get deleted, this will inevitably lead to fragmentation, which will de-linearize writes as they have to go into whatever holes are available in the data. How does this affect nilfs write performance? For now, my understanding, nilfs garbage collector moves the live (in use) blocks to the end of logs, so holes are not created (it is correct?). However, it leads another issue that garbage collector process, which is nilfs_cleanerd, will consume the I/O. This is major I/O performance bottle neck current implementation. Since this moves files, it sounds like this could be a major issue for flash media since it unnecessarily creates additional writes. Can this be suppressed? 5) How does the specific writing amount measure against other file systems (I'm specifically interested in comparisons vs. ext2). What I mean by specific writing amount is for writing, say, 100,000 random sized files, how many write operations and MBs (or sectors) of writes are required for the exact same operation being performed on nilfs and ext2 (e.g. as measured by vmstat -d). You can find public benchmark results at the following links. However those are a bit old and current results may differ. http://www.phoronix.com/scan.php?page=article&item=ext4_btrfs_nilfs2&num=1 http://www.linux-mag.com/cache/7345/1.html Thanks. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD and non-SSD Suitability
Vincent Diepeveen wrote: 1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, so that the writes happen sequentially anyway. Could you explain that, as far as i know modern SSD's have 8 independant channels to do read and writes, which is why they are having that big read and write speed and can in theory therefore support 8 threads doing reads and writes. Each channel say using blocks of 4KB, so it's 64KB in total. I'm talking about something else. I'm talking about the fact that you can turn logical random writes into physical sequential writes by re-mapping logical blocks to sequential physical blocks. Old, naive flash without clever firmware was always good at sequential writes but bad at random writes. Since fragmentation on flash doesn't matter since there is no seek time, modern SSDs use such re-mapping to prolong flash life, reduce the need for erasing blocks and improve random write performance by linearizing it. This is completely independent of the fact that you might be able to write to the flash chips in a more parallel fashion because the disk ASIC has the ability to use more of them simultaneously. Does nilfs demonstrably provide additional benefits on such modern SSDs with sensible firmware? 2) Mechanical disks suffer from slow random writes (or any random operation for that matter), too. Do the benefits of nilfs show in random write performance on mechanical disks? 3) How does this affect real-world read performance if nilfs is used on a mechanical disk? How much additional file fragmentation in absolute terms does nilfs cause? Basically the main difference between SSD's and traditional disks is that SSD's have a faster latency, have more than 1 channel and write small blocks of 4KB, whereas 64KB read/writes are already real small for a traditional disk. Which begs the question why the traditional disks only support multi-sector transfers of up to 16 sectors, but that's a different question. So a file system should benefit from the special properties of a SSD to be suited for this modern hardware. The only actual benefit is decreased latency. 4) As the data gets expired, and snapshots get deleted, this will inevitably lead to fragmentation, which will de-linearize writes as they have to go into whatever holes are available in the data. How does this affect nilfs write performance? 5) How does the specific writing amount measure against other file systems (I'm specifically interested in comparisons vs. ext2). What I mean by specific writing amount is for writing, say, 100,000 random sized files, how many write operations and MBs (or sectors) of writes are required for the exact same operation being performed on nilfs and ext2 (e.g. as measured by vmstat -d). Isn't ext2 a bit old? So? The point is that it has no journal, which means fewer writes. fsck on SSDs only takes a few minutes at most. Of course i understand you skip ext4 as that obviously still has to get bugfixed. It seems to be deemed stable enough for several distros, and will be the default in RHEL6 in a few months' time, so that's less of a concern. I am more interested in metrics for how much writing is required relative to the amount of data being transferred. For example, if I am restoring a full running system (call it 5GB) from a tar ball onto nilfs2, ext2, ext3, btrfs, etc., I am interested in how many blocks worth of writes actually hit the disk, and to a lesser extent how many of those end up being merged together (since merged operations, in theory, can cause less wear on an SSD because bigger blocks can be handle more efficiently if erasing is required. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SSD and non-SSD Suitability
I've got a somewhat broad question on the suitability of nilfs for various workloads and different backing storage devices. From what I understand from the documentation available, the idea is to always write sequentially, and thus avoid slow random writes on old/naive SSDs. Hence I have a few questions. 1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, so that the writes happen sequentially anyway. Does nilfs demonstrably provide additional benefits on such modern SSDs with sensible firmware? 2) Mechanical disks suffer from slow random writes (or any random operation for that matter), too. Do the benefits of nilfs show in random write performance on mechanical disks? 3) How does this affect real-world read performance if nilfs is used on a mechanical disk? How much additional file fragmentation in absolute terms does nilfs cause? 4) As the data gets expired, and snapshots get deleted, this will inevitably lead to fragmentation, which will de-linearize writes as they have to go into whatever holes are available in the data. How does this affect nilfs write performance? 5) How does the specific writing amount measure against other file systems (I'm specifically interested in comparisons vs. ext2). What I mean by specific writing amount is for writing, say, 100,000 random sized files, how many write operations and MBs (or sectors) of writes are required for the exact same operation being performed on nilfs and ext2 (e.g. as measured by vmstat -d). Many thanks. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
TRIM Support?
Hi, I notice that the pitch for NILFS is that it is particularly suitable for flash based media. Does it have any sort of support for TRIM command? If not, is there at least an equivalent of dumpfs that could be used to get the list of free blocks that could be passed to hdparm to issue TRIM commands to the SSD? TIA. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html