Re: Questions on nilfs_cleanerd

2012-03-07 Thread Gordan Bobic

These sound similar to the questions/concerns I raised a while back.


1) Does the daemon read/write the entire drive to look for dead blocks to clean?


Yes - sequentially (and then it rolls over to the beginning again).


2) What if there aren't any dead blocks to clean and the free space in
the drive is still less than 10% (the default min_clean_segments in
the conf file), does the daemon still process the drive? If so, how do
I change the cleaning interval so that it doesn't process the drive as
often?


The only worthwhile suggestion I heard is to set the minimum history 
retention rate (the FS is continuously snapshotting) to 1 day. That way 
you can guarantee the churn rate will never exceed the capacity of the 
disk per day. Not ideal, but at least it puts some kind of a hard limit 
on how quickly it'll waste your flash - at the expense of making the 
problem of the non-determinism of free space a little worse.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Garbage Collection Method

2012-01-27 Thread Gordan Bobic

On 01/27/2012 06:47 PM, Christian Smith wrote:

On Fri, Jan 27, 2012 at 04:26:23PM +, Gordan Bobic wrote:

Christian,

Many thanks for your reply.


1) Does it scan blocks from the tail of the file system forward
sequentially?


Yes


2) Does it reclaim blocks regardless of how dirty they are? Or does
it  execute reclaiming on order of maximum dirtyness first in order
to  reduce churn (and flash wear when used on flash media)?


The former.


3) What happens when it encounters a block that isn't dirty? Does it
skip it and reclaim the next dirty block, leaving a "hole"? Or does
it  reclaim everything up to a reclaimable block to make the free
space  contiguous?


It is cleaned regardless. Free space appears to always be contiguous.


Hmm, so the GC causes completely unnecessary flash wear. That's really
bad for the most advantageous use-case of nilfs2. :(



I work round it by setting my protection period to about 1 day, so
I know that the whole device will not be written more than once per
day. Even with 3000 p/e cycle FLASH, that's eight years of use.


Hmm... So the GC is smart enough to stop reclaiming if the next block to 
be checked has a time stamp that is recent enough?



I find the biggest advantage of NILFS is avoiding the random
small writes that so quickly wear cheap flash out. Even with the
GC, I'd wager NILFS still beats ext3 (say) at avoiding write amp
due to it's more sequential write nature. Not to mention the
performance gains as a result. Random writes are so slow because
each random write might be doing a full block erase, which is
also why their write amp is so bad in the first place. But hey,
they're cheap and designed for camera like write patterns (
writing big files in long contiguous chunks.)


Except they are also the main envisaged storage medium for things like 
ARM machines, most of which are capable of replacing an average desktop 
box if only they didn't lack SATA for a proper SSD.



4) Assuming this isn't already how it works, how difficult would it
be  to modify the reclaim policy (along with associated book-keeping
requirements) to reclaim blocks in the order of dirtiest-block-first?

5) If a suitable book-keeping bitmap was in place for 4), could this
not  be used for accurate df reporting?



Not being a NILFS developer, I can't answer either of these in detail.

However, as I understand it, the filesystem driver does not depend on the
current cleaning policy, and can skip cleaning specific blocks should those
blocks be sufficiently clean. Segments need not be written sequentially,
as each segment contains a pointer to the next segment that will be written
and hence why lssu always lists two segments as active (the current segment
and the next segment to be written).

It's just that the current GC just cleans all segments sequentially. It's
easier to just cycle through the segments in a circular fashion.


I see, so the sub-optimal reclaim and unnecessary churn are purely down
to the userspace GC daemon?

Is there scope for having a bitmap or a counter in each allocation unit
to show how many dirty blocks there are in it? Such a bitmap would
require 1MB of space for every 32GB of storage (assuming 1 bit per 4KB
block). This would allow for being able to tell at a glance which block
is dirties and thus should be reclaimed next, while at the same time
stopping unnecessary churn.



Is 1 bit enough? At what point do you turn the bit on? Half dead segment?
I can't see 1 bit being useful enough to make the overhead worthwhile.
Also, we're not just talking about live current data. There is also
snapshot and checkpoint visible data to consider. Not easy to represent
with a bitmap.


I'm talking about 1 bit per 4KB block. Hence 1MB per 32GB. Since the 
smallest write size is always going to be 1 block (4KB), there is no 
need to track smaller units. And it also means that a single 4KB block 
is either clean or dirty, and nothing inbetween.



What would be useful is to be able to select the write segment into
which the cleaner will write live data. That way, the system could
maintain two
log "heads", one for active hot data, and one for inactive cold data. Then
all cleaning would be done to the cold head, and all new writes to the hot
head on the assumption that the new write will either be temporary (and
hence discarded sooner rather than later) or not be updated for some time
(and hence cleaned to a cold segment by the cleaner) with the hope that
we'll have a bimodal distribution of clean and dirty data. Then the
cleaner can concentrate on cleaning hot segments, with the occasional
clean
of cold segments.


I don't think distinguishing between hot and cold data is all that
useful. Ultimately, the optimal solution would be to reclaim the AUs in
dirtiest-first order. The other throttling provisions (not reclaiming
until free space drops below a threshold) should do eno

Re: Odd problem starting nilfs_cleanerd due to an eMMC misbehaviour

2012-01-27 Thread Gordan Bobic

Christian Smith wrote:

On Thu, Jan 26, 2012 at 05:52:03PM +0400, Paul Fertser wrote:

Hi,

I'm using nilfs2 for the root filesystem on an ARM-based netbook
(Toshiba ac100) with Debian hardfloat. Custom kernel is based on 3.0.8
and nilfs-tools is 2.1.0-1 from the Debian repository.

I wanted to try the threaded i/o test from the Phoronix test suite and
somehow it happened that during the test the garbage collecting daemon
failed and never came back. So i got the filesystem 100% full and
after i noticed it i tried running the daemon manually. It didn't
start even after reboot. Suprisingly, the eMMC error went away on its
own after fully powering off the whole device, and after that the
daemon started to work properly.

I'm not sure what conclusion might be made from this but i'd still
appreciate any comments, especially the suggestions on what to do if
the error didn't "recover".


Remember, SDCards contain their own embedded controller to do the
block mapping between LBA and FLASH blocks. There may even be an ARM
based controller in the SDCard. Under the stress of a benchmark, the
firmware probably just got itself in a bit of a state and needed a
hard reset to recover.

What brand of SD Card is it? Most SD Cards are designed for low
stress low speed IO in devices such as cameras. Perhaps try a
different brand.


I believe Paul was referring to the internal eMMC (not an SD card) on 
the Toshiba AC100. Not something that is easily replaceable. :(


I should also point out that having benchmarked many SD cards, I have 
yet to find any that offer decent performance on random-writes, no 
matter how good they may be at linear writes - hence the interest in nilfs2.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Garbage Collection Method

2012-01-27 Thread Gordan Bobic

Christian,

Many thanks for your reply.

1) Does it scan blocks from the tail of the file system forward  
sequentially?


Yes

2) Does it reclaim blocks regardless of how dirty they are? Or does it  
execute reclaiming on order of maximum dirtyness first in order to  
reduce churn (and flash wear when used on flash media)?


The former.

3) What happens when it encounters a block that isn't dirty? Does it  
skip it and reclaim the next dirty block, leaving a "hole"? Or does it  
reclaim everything up to a reclaimable block to make the free space  
contiguous?


It is cleaned regardless. Free space appears to always be contiguous.


Hmm, so the GC causes completely unnecessary flash wear. That's really 
bad for the most advantageous use-case of nilfs2. :(


4) Assuming this isn't already how it works, how difficult would it be  
to modify the reclaim policy (along with associated book-keeping  
requirements) to reclaim blocks in the order of dirtiest-block-first?


5) If a suitable book-keeping bitmap was in place for 4), could this not  
be used for accurate df reporting?



Not being a NILFS developer, I can't answer either of these in detail.

However, as I understand it, the filesystem driver does not depend on the
current cleaning policy, and can skip cleaning specific blocks should those
blocks be sufficiently clean. Segments need not be written sequentially,
as each segment contains a pointer to the next segment that will be written
and hence why lssu always lists two segments as active (the current segment
and the next segment to be written).

>

It's just that the current GC just cleans all segments sequentially. It's
easier to just cycle through the segments in a circular fashion.


I see, so the sub-optimal reclaim and unnecessary churn are purely down 
to the userspace GC daemon?


Is there scope for having a bitmap or a counter in each allocation unit 
to show how many dirty blocks there are in it? Such a bitmap would 
require 1MB of space for every 32GB of storage (assuming 1 bit per 4KB 
block). This would allow for being able to tell at a glance which block 
is dirties and thus should be reclaimed next, while at the same time 
stopping unnecessary churn.


What would be useful is to be able to select the write segment into which 
the cleaner will write live data. That way, the system could maintain two

log "heads", one for active hot data, and one for inactive cold data. Then
all cleaning would be done to the cold head, and all new writes to the hot
head on the assumption that the new write will either be temporary (and
hence discarded sooner rather than later) or not be updated for some time
(and hence cleaned to a cold segment by the cleaner) with the hope that
we'll have a bimodal distribution of clean and dirty data. Then the 
cleaner can concentrate on cleaning hot segments, with the occasional clean

of cold segments.


I don't think distinguishing between hot and cold data is all that 
useful. Ultimately, the optimal solution would be to reclaim the AUs in 
dirtiest-first order. The other throttling provisions (not reclaiming 
until free space drops below a threshold) should do enough to stop 
premature flash wear.



Accurate df reporting is more tricky, as checkpoints and snapshots make it
decidedly not trivial to account for overwritten data. As such, the current
df reporting is probably the best we can manage within the current
constraints.


With the bitmap solution as described above, would we not be able to 
simply subtract the dirty blocks from the used space? Since the bitmap 
always contains the dirtyness information on all the blocks in the FS, 
this would make for a pretty simple solution, would it not?


Is there anything in place that would prevent such a bitmap from being 
kept in the file system headers? It could even be kept in RAM and 
generated by the garbage collector for it's own use at run-time, 
thinking about it, 1MB per 32GB is not a lot (32MB per TB), and it could 
even be run-length encoded.


Right now, even just preventing reallocation of allocation units that 
are completely clean would be a big advantage in terms of performance 
and flash wear.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Garbage Collection Method

2012-01-26 Thread Gordan Bobic

Hi,

Quick question about the garbage collector and what it reclaims and in 
what order.


1) Does it scan blocks from the tail of the file system forward 
sequentially?


2) Does it reclaim blocks regardless of how dirty they are? Or does it 
execute reclaiming on order of maximum dirtyness first in order to 
reduce churn (and flash wear when used on flash media)?


3) What happens when it encounters a block that isn't dirty? Does it 
skip it and reclaim the next dirty block, leaving a "hole"? Or does it 
reclaim everything up to a reclaimable block to make the free space 
contiguous?


4) Assuming this isn't already how it works, how difficult would it be 
to modify the reclaim policy (along with associated book-keeping 
requirements) to reclaim blocks in the order of dirtiest-block-first?


5) If a suitable book-keeping bitmap was in place for 4), could this not 
be used for accurate df reporting?


TIA.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cache Churn

2011-08-23 Thread Gordan Bobic
On Tue, 23 Aug 2011 14:38:49 +0900 (JST), Ryusuke Konishi 
 wrote:

Hi,
On Wed, 10 Aug 2011 12:17:45 +0100, Gordan Bobic wrote:
 Another performance related problem I am seeing due to 
nilfs_cleanerd

 is that it causes unhealthy amounts of cache churn. It's reads and
 writes are buffered, which inevitably means that things it reads 
will
 get cached. Since it is going through all the blocks on the fs that 
have
 any garbage to collect, it will eat through all the available 
memory
 pretty quickly. It also means that it will push out of caches 
things

 that really should stay in caches.


Interesting report.  nilfs_cleanerd only reads log header and does 
not

read payload blocks.  Data blocks are instead read and copied by the
nilfs kernel code, and they are freed every time reclamation call of 
a

few segments has ended.

I guess the abnormal cache churn arose from other causes, seems that
DAT file access is suspicious.  (The DAT file holds metadata used to
convert virtual block addresses to real disk block addresses).

 Since cleanerd's actual disk I/O is going to have no correlation 
with

 actual file access pattern, is there a way to make cleanerd always
 operate with something like the O_DIRECT flag so that is's reads 
won't

 fill up the page cache?


If the problem comes from internal metadata accesses like the DAT 
file

access, O_DIRECT is not applicable.


 This is a pretty serious problem on small machines running of cheap
 flash (think ARM machines with 512MB of RAM and slow flash media).

 The quick and dirty workaround I am pondering at the moment is to 
set

 up a cron job that runs once/minute, checks df, and starts/kills
 nilfs_cleanerd depending on how much free space is available, but 
that's

 not really a solution.

 Gordan


Does your kernel version equal to or newer than v2.6.37 ?


I am running 2.6.38.8 + chromos patches (running on Tegra2 ARM).


Last year, we changed cache usage for the DAT file on that kernel.
This might influence the issue.


I am running 2.0.23 nilfs-utils.

The cache churn issue is trivial to reproduce:

1) On an otherwise idle machine, set the thresholds appropriately to 
make nilfs_cleanerd reclaim some space


2) echo 3 > /proc/sys/vm/drop_caches

3) Observe top and iotop to establish that:
- nilfs_cleanerd is the only thing running and doing anything
- cache memory is growing at the same rate at which iotop is saying 
nilfs_cleanerd is doing I/O


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: More on nilfs_cleanerd and excessive writes (1 month flash card life expetancy)

2011-08-12 Thread Gordan Bobic

As of which version? I'm running 2.0.15.

Gordan

On Fri, 12 Aug 2011 09:51:03 -0400, Jérôme Poulin 
 wrote:

I do not know what version of the NILFS-Tools you're using but the
latest is configurable in this way, that it will only clean when 
space

is critical.

Envoyé de mon appareil mobile.

Jérôme Poulin
Solutions G.A.

On 2011-08-12, at 06:38, Gordan Bobic  wrote:

I just did some basic measuring and it looks like the total writes 
by nilfs_cleanerd on my SD card total about 1GB/minute (16MB/second, 
all my card can handle). Since the system is used all the time while 
it is on, that involves there always being things that need to be 
garbage collected, so it runs all the time. Even assuming it 
performance isn't an issue (running at nice 19 and ionice -c3, and 
performance IS an issue), that still means that the SD card will get 
1,440GB of writes/day (1,4TB!). It's a 32GB MLC flash card, so 
assuming a 5,000 erase cycle life of 32nm MLC (ignoring any inevitable 
write amplification), that gives life expectancy of 160TB, or at the 
given rate of nilfs_cleanerd churn, about 12 days of usage. Call it a 
month with the assumption the machine isn't used all day every day.


This is quite thoroughly unacceptable for usage on any flash media. 
Ignoring any other optimizations that might be applicable (e.g. 
smaller block size to minimize the number of blocks that have to be 
re-written), my immediate redneck solution is running this every 
minute as a cron job:



==
#!/bin/bash

# Substitute /dev/mmcblk1p4 for your nilfs partition
used=`df | grep /dev/mmcblk1p4 | awk '{ print $5; }' | sed -e 
's/%//'`


# If disk usage is more than 90%...
if [ $used -gt 90 ]; then
   # If nilfs_cleanerd is not running...
   if (! pgrep nilfs_cleanerd > /dev/null ); then
   nohup nice -n 19 ionice -c 3 /sbin/nilfs_cleanerd > /dev/null 
2>&1 &

   fi
# If disk usage is less than 90%...
elif [ $used -lt 80 ]; then
   pkill nilfs_cleanerd > /dev/null 2>&1
fi
==

This could of course be improved and "enterpriseified" further, e.g. 
check for all nilfs partitions and do the checks on all of them, make 
the free space amount thresholds based on 1/3 and 2/3 of free space 
(fs size - du), but this problem shouldn't really be looking for a 
solution in a cron job.


It's not ideal and nilfs_cleanerd should be configurable to moderate 
itself in a similar way, but until that happens, I don't see any 
alternative to the above cron job. The write performance is fantastic 
for tasks that do a lot of writing, but the life expectancy issue is a 
very real one.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe 
linux-nilfs" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


More on nilfs_cleanerd and excessive writes (1 month flash card life expetancy)

2011-08-12 Thread Gordan Bobic
I just did some basic measuring and it looks like the total writes by 
nilfs_cleanerd on my SD card total about 1GB/minute (16MB/second, all my 
card can handle). Since the system is used all the time while it is on, 
that involves there always being things that need to be garbage 
collected, so it runs all the time. Even assuming it performance isn't 
an issue (running at nice 19 and ionice -c3, and performance IS an 
issue), that still means that the SD card will get 1,440GB of writes/day 
(1,4TB!). It's a 32GB MLC flash card, so assuming a 5,000 erase cycle 
life of 32nm MLC (ignoring any inevitable write amplification), that 
gives life expectancy of 160TB, or at the given rate of nilfs_cleanerd 
churn, about 12 days of usage. Call it a month with the assumption the 
machine isn't used all day every day.


This is quite thoroughly unacceptable for usage on any flash media. 
Ignoring any other optimizations that might be applicable (e.g. smaller 
block size to minimize the number of blocks that have to be re-written), 
my immediate redneck solution is running this every minute as a cron 
job:



==
#!/bin/bash

# Substitute /dev/mmcblk1p4 for your nilfs partition
used=`df | grep /dev/mmcblk1p4 | awk '{ print $5; }' | sed -e 's/%//'`

# If disk usage is more than 90%...
if [ $used -gt 90 ]; then
# If nilfs_cleanerd is not running...
if (! pgrep nilfs_cleanerd > /dev/null ); then
nohup nice -n 19 ionice -c 3 /sbin/nilfs_cleanerd > /dev/null 
2>&1 &

fi
# If disk usage is less than 90%...
elif [ $used -lt 80 ]; then
pkill nilfs_cleanerd > /dev/null 2>&1
fi
==

This could of course be improved and "enterpriseified" further, e.g. 
check for all nilfs partitions and do the checks on all of them, make 
the free space amount thresholds based on 1/3 and 2/3 of free space (fs 
size - du), but this problem shouldn't really be looking for a solution 
in a cron job.


It's not ideal and nilfs_cleanerd should be configurable to moderate 
itself in a similar way, but until that happens, I don't see any 
alternative to the above cron job. The write performance is fantastic 
for tasks that do a lot of writing, but the life expectancy issue is a 
very real one.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cache Churn

2011-08-10 Thread Gordan Bobic
Another performance related problem I am seeing due to nilfs_cleanerd 
is that it causes unhealthy amounts of cache churn. It's reads and 
writes are buffered, which inevitably means that things it reads will 
get cached. Since it is going through all the blocks on the fs that have 
any garbage to collect, it will eat through all the available memory 
pretty quickly. It also means that it will push out of caches things 
that really should stay in caches.


Since cleanerd's actual disk I/O is going to have no correlation with 
actual file access pattern, is there a way to make cleanerd always 
operate with something like the O_DIRECT flag so that is's reads won't 
fill up the page cache?


This is a pretty serious problem on small machines running of cheap 
flash (think ARM machines with 512MB of RAM and slow flash media).


The quick and dirty workaround I am pondering at the moment is to set 
up a cron job that runs once/minute, checks df, and starts/kills 
nilfs_cleanerd depending on how much free space is available, but that's 
not really a solution.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nilfs_cleanerd using a lot of disk-write bandwidth

2011-08-09 Thread Gordan Bobic
On Tue, 9 Aug 2011 17:19:01 +0200, dexen deVries 
 wrote:

On Tuesday 09 of August 2011 14:25:07 you wrote:

 Interesting. I still think something should be done to minimize the
 amount of writes required. How about something like the following.
 Divide situations into 3 classes (thresholds should be adjustable 
in

 nilfs_cleanerd.conf):

 1) Free space good (e.g. space >= 25%)
 Don't do any garbage collection at all, unless an entire block 
contains

 only garbage.

 2) Free space low (e.g. 10% < space < 25%)
 Run GC as now, with the nice/ionice applied. Only GC blocks where
 $block_free_space_percent >= $disk_free_space_percent. So as the 
disk
 space starts to decrease, the number of blocks that get considered 
for

 GC increase, too.

 3) Free space critical (e.g. space < 10%)
 As 2) but start decreasing niceness/ioniceness (niceness by 3 for 
every

 1% drop in free space, so for example:
 10% - 19
 ...
 7% - 10
 ...
 4% - 1
 3% - -2
 ...
 1% - -8

 This would give a very gradual increase in GC aggressiveness that 
would
 both minimize unnecessary writes that shorted flash life and 
provide a
 softer landing in terms of performance degradation as space starts 
to

 run out.

 The other idea that comes to mind on top of this is to GC blocks in
 order of % of space in the block being reclaimable. That would 
allow for
 the minimum number of blocks to always be GC-ed to get the free 
space

 above the required threshold.

 Thoughts?



Could end up being too slow. A 2TB filesystem has about 260'000
segments (given
the default size of 8MB). cleanerd already takes quite a bit of CPU
power  at times.

Also, cleanerd can do a lot of HDD seeks, if some parts of metadata
aren't in
cache. Performing some 260'000 seeks on a harddrive would take 
anywhere from

1000 to 3000 seconds; that not very interactive. Actually, it gets
dangerously close to an hour.

However, if the cleanerd did not have to follow this exact algorithm, 
but
instead id something roughly similar (heueristics rather than 
algorithm), it

could be good enough.


Well, you could adjust all the numbers in the algorithm. :)

As an aside, why would you use nilfs on a multi-TB FS? What's the 
advantage? The way I see it the killer application for nilfs is slow 
flash media with (probably) poorly implemented wear leveling.


The idea of the above is that you don't end up suffering poor disk 
performance due to background clean-up until you actually have a 
plausible chance of running out of space. What is the point of GC-ing if 
there is already 80% of empty space ready for writing to? All you'll be 
doing is making the fs slow for no obvious gain.



Possibly related, I'd love if cleanerd tented to do some mild
de-fragmentation
of files. Not necessarily full-blown, exact defragmentation, just
placing quite stuff close together.


If it's garbage collecting involves reading a block and re-writing it 
without the deleted data, then isn't that already effectively 
defragmenting the fs?


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nilfs_cleanerd using a lot of disk-write bandwidth

2011-08-09 Thread Gordan Bobic
On Tue, 9 Aug 2011 13:03:54 +0200, dexen deVries 
 wrote:

Hi Gordan,


On Tuesday 09 of August 2011 12:18:12 you wrote:
 I'm seeing nilfs_cleanerd using a lot of disk write bandwidth 
according
 to iotop. It seems to be performing approximately equal amounts of 
reads
 and writes when it is running. Reads I can understand, but why is 
it

 writing so much in order to garbage collect? Should it not be just
 trying to mark blocks as free? The disk I/O r/w symmetry implies 
that it
 is trying to do something like defragment the file system. Is there 
a
 way to configure this behaviour in some way? The main use-case I 
have
 for nilfs is cheap flash media that suffers from terrible 
random-write
 performance, but on such media this many writes are going to cause 
media

 failure very quickly. What can be done about this?



I'm not a NILFS2 developer, so don't rely too much on the following 
remarks!


NILFS2 consider filesystem as a (wrapped around) list of segments, by
default
each 8MB. Those segments contain both file data and metadata.

cleanerd operates on whole segments; normally either 2 or 4 in one 
pass
(depending on remaining free space). It seems to me a segment is 
reclaimed

when there is any amount of garbage in it, no matter how small. Thus
you see,
in some cases, about as much of read as of write.

One way could be be to make cleanerd configurable so it doesn't 
reclaim

segments that have only very little garbage in them. That would
probably be a
trade-off between wasted diskspace and lessened bandwidth use.

As for wearing flash media down, I believe NILFS2 is still very good
for them,
because it tends to write in large chunks -- much larger than the 
original
512B sector -- and not over-write once written areas (untill 
reclaimed by

cleanerd, often much, much later). Once the flash' large erase unit
is erased,
NILFS2 append-writes to it, but not over-writes already written data. 
Which

means the flash is erased almost as little as possible.


Interesting. I still think something should be done to minimize the 
amount of writes required. How about something like the following. 
Divide situations into 3 classes (thresholds should be adjustable in 
nilfs_cleanerd.conf):


1) Free space good (e.g. space >= 25%)
Don't do any garbage collection at all, unless an entire block contains 
only garbage.


2) Free space low (e.g. 10% < space < 25%)
Run GC as now, with the nice/ionice applied. Only GC blocks where 
$block_free_space_percent >= $disk_free_space_percent. So as the disk 
space starts to decrease, the number of blocks that get considered for 
GC increase, too.


3) Free space critical (e.g. space < 10%)
As 2) but start decreasing niceness/ioniceness (niceness by 3 for every 
1% drop in free space, so for example:

10% - 19
...
7% - 10
...
4% - 1
3% - -2
...
1% - -8

This would give a very gradual increase in GC aggressiveness that would 
both minimize unnecessary writes that shorted flash life and provide a 
softer landing in terms of performance degradation as space starts to 
run out.


The other idea that comes to mind on top of this is to GC blocks in 
order of % of space in the block being reclaimable. That would allow for 
the minimum number of blocks to always be GC-ed to get the free space 
above the required threshold.


Thoughts?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nilfs_cleanerd using a lot of disk-write bandwidth

2011-08-09 Thread Gordan Bobic

Hi,

I'm seeing nilfs_cleanerd using a lot of disk write bandwidth according 
to iotop. It seems to be performing approximately equal amounts of reads 
and writes when it is running. Reads I can understand, but why is it 
writing so much in order to garbage collect? Should it not be just 
trying to mark blocks as free? The disk I/O r/w symmetry implies that it 
is trying to do something like defragment the file system. Is there a 
way to configure this behaviour in some way? The main use-case I have 
for nilfs is cheap flash media that suffers from terrible random-write 
performance, but on such media this many writes are going to cause media 
failure very quickly. What can be done about this?


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Applying nice/ionice to nilfs-cleanerd

2011-08-08 Thread Gordan Bobic
On Mon, 8 Aug 2011 10:09:31 +0200, dexen deVries 
 wrote:


[lowering cleanerd priority]


However, may be a downside to /always/ running cleanerd niced and
ioniced.  I
believe that currently cleanerd's activity slows other processes down 
a lot

when filesystem is almost full -- which means that it oftet won't
become truly
full, because clearned will free enough space for other processes to 
be able
to complete their work. If, on the other hand, cleanerd was highly 
niced and
ioniced, it could end up being starved of CPU and disk bandwidth and 
not
freeing enough free space, which could cause other processes to 
exhaust free

space on filesystem and abord when not able to write to filesystem.


I was just thinking about that. This would only be an issue on a system 
that is either very constrained in terms of disk space or is never idle. 
though.


Perhaps it would be enough to have cleanerd automatically switch 
priority

based on available free space. For example, if I had
min_clean_segments  10%
max_clean_segments  12%

then also have
min_clean_segments_low_prio 8%

low_prio_nice 19
normal_prio_nice 0

low_prio_ionice_class idle
normal_prio_ionice_class realtime

which woud mean, `use low priority (nice & ionice) when there's at
least 8% of
free segments; if there's less use higher priority' -- so cleanerd 
would
reclaim free space more aggressively when there's little free space 
left.


I was thinking about something similar. Realtime ionice is OTT, though, 
I don't think it should ever be ioniced over normal. But yes, I think 
this would be a good idea.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Applying nice/ionice to nilfs-cleanerd

2011-08-07 Thread Gordan Bobic

On 08/08/2011 12:23 AM, Ryusuke Konishi wrote:


  Is there a way to set default nice/ionice levels for nilfs-cleanerd?


At present, you have to manually invoke the cleanerd through the
nice/ionice commands or to run renice/ionice later specifying the
process ID of the cleanerd.

One way to make this convenient is introducing new directives in
/etc/nilfs_cleanerd.conf as follows:

  # Scheduling priority.
  nice 19# niceness -20~19

  # IO scheduling class.
  # Supported classes are default, idle, best-effort, and realtime.
  ionice_class  idle

  # IO scheduling priority.
  # 0-7 is valid for best-effort and realtime classes.
  ionice_data   5

Do you think these extensions make sense ?


Yes, I think those would be really handy. It would also mean that the 
cleanerd could be scheduled to run more aggressively but at lower 
priority, so the clean-up would be potentially more up to date while 
having less impact on the system performance.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Applying nice/ionice to nilfs-cleanerd

2011-08-07 Thread Gordan Bobic

Hi,

Is there a way to set default nice/ionice levels for nilfs-cleanerd?

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-29 Thread Gordan Bobic

David Arendt wrote:

4) As the data gets expired, and snapshots get deleted, this will 
inevitably lead to fragmentation, which will de-linearize writes as they 
have to go into whatever holes are available in the data. How does this 
affect nilfs write performance?


For now, my understanding, nilfs garbage collector moves the live (in use)
blocks to the end of logs, so holes are not created (it is correct?).
However, it leads another issue that garbage collector process, which is
nilfs_cleanerd, will consume the I/O.  This is major I/O performance
bottle neck current implementation.
  
Since this moves files, it sounds like this could be a major issue for 
flash media since it unnecessarily creates additional writes. Can this 
be suppressed?


You can simply kill the nilfs_clearnerd after you mount the nilfs partition.
  

If you use the latest nilfs_utils, killing nilfs_cleanerd is no longer
necessary. You can use mount -o nogc. This will not start
nilfs_cleanerd. Another possibility is to let nilfs_cleanerd start and
tweak min_free_segments and max_free_segments so that cleanerd will only
do cleaning if necessary.


What about making the gc run only if the disk has been idle for, say, 
20ms, unless min_free_segments is reached?


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-29 Thread Gordan Bobic

Jiro SEKIBA wrote:

2) Mechanical disks suffer from slow random writes (or any random 
operation for that matter), too. Do the benefits of nilfs show in random 
write performance on mechanical disks?

I think it may have benefits, for nilfs will write sequentially whatever
data is located before writing it.  But still some tweaks might be required
to speed up compared with ordinary filsystem like ext3.
Can you quantify what those tweaks may be, and when they might become 
available/implemented?


I might choose the wrong word, but what I meant is more hack is required
to improve write performance.  Not just configuration matters :(.


I understand what you meant. I just wanted to know when those hacks may 
be implemented and be available for those of us interested in using 
nilfs to optimize write-heavy workloads.


3) How does this affect real-world read performance if nilfs is used on 
a mechanical disk? How much additional file fragmentation in absolute 
terms does nilfs cause?

The data is scattered if you modified the file again and again,
but it'll be almost sequential at the creation time.  So it will
affect much if files are modified frequently.

Right. So bad for certain tasks, such as databases.


Indeed. maybe /var type of directories too.


Interesting. So nilfs' suitability for write heavy loads is actually 
quite limited on mechanical disks, as it isn't suitable for append-heavy 
situations such as databases and logging, but for use-cases that are 
write+delete heavy such as mail servers or other spool type loads it 
should still be advantageous.


4) As the data gets expired, and snapshots get deleted, this will 
inevitably lead to fragmentation, which will de-linearize writes as they 
have to go into whatever holes are available in the data. How does this 
affect nilfs write performance?

For now, my understanding, nilfs garbage collector moves the live (in use)
blocks to the end of logs, so holes are not created (it is correct?).
However, it leads another issue that garbage collector process, which is
nilfs_cleanerd, will consume the I/O.  This is major I/O performance
bottle neck current implementation.
Since this moves files, it sounds like this could be a major issue for 
flash media since it unnecessarily creates additional writes. Can this 
be suppressed?


You can simply kill the nilfs_clearnerd after you mount the nilfs partition.
This case, of course, any garbage is reclaimed and finally end up with
disk full, even size of files don't occupy the storage size.

I don't have data for now, but it made about twice better write performance
compared with "with garbage collector".


What about enabling garbage collection, but disabling degragmentation? 
De-allocating space that isn't used any more is a necessary evil, but 
defragmentation is rather pointless in a lot of cases (e.g. SSDs) and 
counter-productive in others (mechanical disks under heavy load). Also, 
what about making the garbage collector "lazy", so that it runs either 
just-in time to overwrite discarded data (worst case scenario) or runs 
when the disks are idle (e.g. at ionice -c3, and even that only when 
there have been no disk transactions for, some selectable number of ms)?


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-28 Thread Gordan Bobic
This thread will continue off list because it seems to have lost all 
relevance to nilfs.



Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-28 Thread Gordan Bobic

Vincent Diepeveen wrote:

1) Modern SSDs (e.g. Intel) do this logical/physical mapping 
internally, so that the writes happen sequentially anyway.
Could you explain that, as far as i know modern SSD's have 8 
independant channels to do read and writes, which is why they are 
having that big read and write speed and can in theory therefore 
support 8 threads doing reads and writes. Each channel say using 
blocks of 4KB, so it's 64KB in total.


I'm talking about something else. I'm talking about the fact that 
you can turn logical random writes into physical sequential writes 
by re-mapping logical blocks to sequential physical blocks.

That's doing 2 steps back in history isn't it?


Sorry, I don't see what you mean. Can you elaborate?


I didn't investigate NILFS, but under all conditions what you want to 
avoid is some sort of central locking of the file system,
because if you're proposing all sorts of fancy stuff to the file system 
whereas you can already do your thing using full bandwidth of the SSD.


Are you actually claiming that you can achieve full write throughput on 
random writes that you can achieve on sequential writes on an SSD? Try 
that with write caches on the drive disabled.


It really is interesting to have a file system where you do a minimum 
number of actions to the file system
so that other threads can do there work there. Any complicated 
datastructure manipulation that requires central locking

or other forms of complicated locking will limit other i/o actions.


I agree.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-28 Thread Gordan Bobic

Vincent Diepeveen wrote:

The big speedup that SSD's deliver for average usage is ESPECIALLY 
because of the faster random access to the hardware.


Sure - on reads. Writes are a different beast. Look at some reviews of 
SSDs of various types and generations. Until relatively recently, 
random write performance (and to a large extent, any write 
performance) on them has been very poor. Cheap flash media (e.g. USB 
sticks) still suffers from this.




You wouldn't want to optimize a file system for hardware of the past is it?

>
Before a file system is any mature, the hardware that is the standard 
today will be very common.


There are a few problems with that line of reasoning.

1) Legacy support is important. If it wasn't, file systems would be 
strictly in the realm of fixed disk manufacturers, and we would all be 
using object based storage. This hasn't happened, nor is it likely to in 
the next decade.


2) We cannot optimize for hardware of the future, because this hardware 
may never arrive.


3) "Hardware of the past" is still very much in full production, and 
isn't going away any time soon.


The only sane option is to optimize for what is prevalent right now.

if you have some petabytes of storage, i guess the bigger bandwidth 
that SSD's deliver is not relevant, as the limitation
is the network bandwidth anyway, so some raid5 with extra spare will 
deliver more than sufficient bandwidth.


RAID3/4/5/6 is inherently unsuitable for fast random writes because if 
a write-read-write cycle required to update the parity.




Nearly all major supercomputers use raid5 with extra spare as well as 
most database servers.


Can you quantify that bold statement?

I would expect vastly higher levels of RAID than RAID5 on 
supercomputers, because RAID5 doesn't scale sufficiently. RAID6 is a bit 
better, but still doesn't really scale. It comes down to data error 
rates on disks. RAID5 with current error rates tops out at about 6-8TB, 
which is pitifully small on the supercomputer scale.


Anybody deploying RAID5 on high-performance database servers that are 
expected to have more than about 1% write:read ratio has no business 
being a database administrator, IMO.


Then again the fact that I have managed to optimize the performance of 
most systems I've been called to provide consultancy on by factors of 
between 10 and 1000 without requiring any new hardware shows me that the 
industry is full of people who haven't got a clue what they are doing.



Stock exchange is more into raid10 type clustering,
but those few harddrives that the stock exchange uses, is that relevant?


You're pulling examples out of the air, and it is difficult to discuss 
them without in-depth system design information. And I doubt you have 
access to that level of the system design information of stock exchange 
systems unless you work for one. Do you?


So a file system should benefit from the special properties of a 
SSD to be suited for this modern hardware.


The only actual benefit is decreased latency.
Which is mighty important; so the ONLY interesting type of filesystem 
for a SSD is a filesystem

that is optimized for read and write latency rather than bandwidth IMHO.


Indeed, I agree (up to a point). Random IOPS has long been the 
defining measure of disk performance for a reason.


I'm always very careful saying a benchmark is holy.


Most aren't, but every once in a while a meaningful one comes up. Random 
IOPS one is one such (relatively rare) example.



Especially read latency i consider most important.


Depends on your application. Remember that reads can be sped up by 
caching.


Even relative simple caching is very difficult to improve, with random 
reads.


The random read speed is of overwhelming influence.


20 years of experience in high-performance applications, databases and 
clusters showed me otherwise. Random read speed is only an issue until 
your caches are primed, or if your data set is sufficiently big to 
overwhelm any practical amount of RAM you could apply.


I look after a number of systems running applications that are 
write-bound because the vast majority of reads can be satisfied from 
page cache, but writes are unavoidable because transactions have to be 
committed to persistent storage.


You're assuming the working set size fits in caching, which is a very 
interesting assumption.


Not necessarily the whole working set, but a decent chunk of it, yes. If 
it doesn't, you probably need to re-assess what you're trying to do.


For example, on databases, as a rule of thumb you need to size your RAM 
so that all indexes aggregated fit into 50-75% of your RAM. The rest of 
the RAM is used for page caches for the actual data.


To put it into a different perspective - a typical RHEL server install 
is 5-6GB. That fits into the RAM on the machine on my desk, and almost 
fits into the RAM of the laptop on typing up this email on.


If your working set is measured in petabytes, then you are probably

Re: SSD and non-SSD Suitability

2010-05-28 Thread Gordan Bobic

Vincent Diepeveen wrote:

1) Modern SSDs (e.g. Intel) do this logical/physical mapping 
internally, so that the writes happen sequentially anyway.
Could you explain that, as far as i know modern SSD's have 8 
independant channels to do read and writes, which is why they are 
having that big read and write speed and can in theory therefore 
support 8 threads doing reads and writes. Each channel say using 
blocks of 4KB, so it's 64KB in total.


I'm talking about something else. I'm talking about the fact that you 
can turn logical random writes into physical sequential writes by 
re-mapping logical blocks to sequential physical blocks.


That's doing 2 steps back in history isn't it?


Sorry, I don't see what you mean. Can you elaborate?

The big speedup that SSD's deliver for average usage is ESPECIALLY 
because of the faster random access to the hardware.


Sure - on reads. Writes are a different beast. Look at some reviews of 
SSDs of various types and generations. Until relatively recently, random 
write performance (and to a large extent, any write performance) on them 
has been very poor. Cheap flash media (e.g. USB sticks) still suffers 
from this.


Don't confuse fast random reads with fast random writes.

if you have some petabytes of storage, i guess the bigger bandwidth that 
SSD's deliver is not relevant, as the limitation
is the network bandwidth anyway, so some raid5 with extra spare will 
deliver more than sufficient bandwidth.


RAID3/4/5/6 is inherently unsuitable for fast random writes because if a 
write-read-write cycle required to update the parity.


So a file system should benefit from the special properties of a SSD 
to be suited for this modern hardware.


The only actual benefit is decreased latency.


Which is mighty important; so the ONLY interesting type of filesystem 
for a SSD is a filesystem

that is optimized for read and write latency rather than bandwidth IMHO.


Indeed, I agree (up to a point). Random IOPS has long been the defining 
measure of disk performance for a reason.



Especially read latency i consider most important.


Depends on your application. Remember that reads can be sped up by caching.

I look after a number of systems running applications that are 
write-bound because the vast majority of reads can be satisfied from 
page cache, but writes are unavoidable because transactions have to be 
committed to persistent storage.


You cannot limit your performance assessment to the use-case of an 
average desktop user running Firefox, Thunderbird and OpenOffice 99% of 
the time. Those are not the users that file systems advances of the past 
30 years are aimed at.


Of course i understand you skip ext4 as that obviously still has to 
get bugfixed.


It seems to be deemed stable enough for several distros, and will be 
the default in RHEL6 in a few months' time, so that's less of a concern.




I ran into severe problems with ext4 and i just used it at 1 harddrive, 
same experiences with other linux users.


How recently have you tried it? RHEL6b has only been out for a month.


Note i used ubuntu.


I guess that explains some of your desktop-centric views.


Stuff like RHEL is more expensive a copy  than i have at my bank account.


RHEL6b is a public beta, freely downloadable.

CentOS is a community recompile of RHEL, 100% binary compatible, just 
with different artwork/logos. Freely available. As is Scientific Linux 
(a very similar project to CentOS, also a free recompile of RHEL). If 
you haven't found them, you can't have looked very hard.


I am more interested in metrics for how much writing is required 
relative to the amount of data being transferred. For example, if I am 
restoring a full running system (call it 5GB) from a tar ball onto 
nilfs2, ext2, ext3, btrfs, etc., I am interested in how many blocks 
worth of writes actually hit the disk, and to a lesser extent how many 
of those end up being merged together (since merged operations, in 
theory, can cause less wear on an SSD because bigger blocks can be 
handle more efficiently if erasing is required.


The most efficient blocksize for SSD's is 8 channels of 4KB blocks.


I'm not going to bite and get involved in debating the correctness of 
this (somewhat limited) view. I'll just point out that it bears very 
little relevant to the paragraph that it appears to be responding to.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-28 Thread Gordan Bobic

Jiro SEKIBA wrote:


I haven't got any particular quantitative data by my own,
so I'll write somewhat subjective opinion.


Thanks, I appreciate it. :)

I've got a somewhat broad question on the suitability of nilfs for 
various workloads and different backing storage devices. From what I 
understand from the documentation available, the idea is to always write 
sequentially, and thus avoid slow random writes on old/naive SSDs. Hence 
I have a few questions.


1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, 
so that the writes happen sequentially anyway. Does nilfs demonstrably 
provide additional benefits on such modern SSDs with sensible firmware?


In terms of writing performance, it may not have additional benefits I guess.
However, it still have benefits with regard to continuous snapshots.


How does this compare with btrfs snapshots? When you say continuous, 
what are the breakpoints between them?


2) Mechanical disks suffer from slow random writes (or any random 
operation for that matter), too. Do the benefits of nilfs show in random 
write performance on mechanical disks?


I think it may have benefits, for nilfs will write sequentially whatever
data is located before writing it.  But still some tweaks might be required
to speed up compared with ordinary filsystem like ext3.


Can you quantify what those tweaks may be, and when they might become 
available/implemented?


3) How does this affect real-world read performance if nilfs is used on 
a mechanical disk? How much additional file fragmentation in absolute 
terms does nilfs cause?


The data is scattered if you modified the file again and again,
but it'll be almost sequential at the creation time.  So it will
affect much if files are modified frequently.


Right. So bad for certain tasks, such as databases.

4) As the data gets expired, and snapshots get deleted, this will 
inevitably lead to fragmentation, which will de-linearize writes as they 
have to go into whatever holes are available in the data. How does this 
affect nilfs write performance?


For now, my understanding, nilfs garbage collector moves the live (in use)
blocks to the end of logs, so holes are not created (it is correct?).
However, it leads another issue that garbage collector process, which is
nilfs_cleanerd, will consume the I/O.  This is major I/O performance
bottle neck current implementation.


Since this moves files, it sounds like this could be a major issue for 
flash media since it unnecessarily creates additional writes. Can this 
be suppressed?


5) How does the specific writing amount measure against other file 
systems (I'm specifically interested in comparisons vs. ext2). What I 
mean by specific writing amount is for writing, say, 100,000 random 
sized files, how many write operations and MBs (or sectors) of writes 
are required for the exact same operation being performed on nilfs and 
ext2 (e.g. as measured by vmstat -d).


You can find public benchmark results at the following links.
However those are a bit old and current results may differ.

http://www.phoronix.com/scan.php?page=article&item=ext4_btrfs_nilfs2&num=1
http://www.linux-mag.com/cache/7345/1.html


Thanks.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD and non-SSD Suitability

2010-05-28 Thread Gordan Bobic

Vincent Diepeveen wrote:

1) Modern SSDs (e.g. Intel) do this logical/physical mapping 
internally, so that the writes happen sequentially anyway.


Could you explain that, as far as i know modern SSD's have 8 independant 
channels to do read and writes, which is why they are having that big 
read and write speed and can in theory therefore support 8 threads doing 
reads and writes. Each channel say using blocks of 4KB, so it's 64KB in 
total.


I'm talking about something else. I'm talking about the fact that you 
can turn logical random writes into physical sequential writes by 
re-mapping logical blocks to sequential physical blocks. Old, naive 
flash without clever firmware was always good at sequential writes but 
bad at random writes. Since fragmentation on flash doesn't matter since 
there is no seek time, modern SSDs use such re-mapping to prolong flash 
life, reduce the need for erasing blocks and improve random write 
performance by linearizing it.


This is completely independent of the fact that you might be able to 
write to the flash chips in a more parallel fashion because the disk 
ASIC has the ability to use more of them simultaneously.


Does nilfs demonstrably provide additional benefits on such modern 
SSDs with sensible firmware?


2) Mechanical disks suffer from slow random writes (or any random 
operation for that matter), too. Do the benefits of nilfs show in 
random write performance on mechanical disks?


3) How does this affect real-world read performance if nilfs is used 
on a mechanical disk? How much additional file fragmentation in 
absolute terms does nilfs cause?




Basically the main difference between SSD's and traditional disks is 
that SSD's have a faster latency, have more than 1 channel and write 
small blocks of 4KB, whereas 64KB read/writes are already real small for 
a traditional disk.


Which begs the question why the traditional disks only support 
multi-sector transfers of up to 16 sectors, but that's a different question.


So a file system should benefit from the special properties of a SSD to 
be suited for this modern hardware.


The only actual benefit is decreased latency.

4) As the data gets expired, and snapshots get deleted, this will 
inevitably lead to fragmentation, which will de-linearize writes as 
they have to go into whatever holes are available in the data. How 
does this affect nilfs write performance?


5) How does the specific writing amount measure against other file 
systems (I'm specifically interested in comparisons vs. ext2). What I 
mean by specific writing amount is for writing, say, 100,000 random 
sized files, how many write operations and MBs (or sectors) of writes 
are required for the exact same operation being performed on nilfs and 
ext2 (e.g. as measured by vmstat -d).


Isn't ext2 a bit old?


So? The point is that it has no journal, which means fewer writes. fsck 
on SSDs only takes a few minutes at most.


Of course i understand you skip ext4 as that obviously still has to get 
bugfixed.


It seems to be deemed stable enough for several distros, and will be the 
default in RHEL6 in a few months' time, so that's less of a concern.


I am more interested in metrics for how much writing is required 
relative to the amount of data being transferred. For example, if I am 
restoring a full running system (call it 5GB) from a tar ball onto 
nilfs2, ext2, ext3, btrfs, etc., I am interested in how many blocks 
worth of writes actually hit the disk, and to a lesser extent how many 
of those end up being merged together (since merged operations, in 
theory, can cause less wear on an SSD because bigger blocks can be 
handle more efficiently if erasing is required.


Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SSD and non-SSD Suitability

2010-05-26 Thread Gordan Bobic
I've got a somewhat broad question on the suitability of nilfs for 
various workloads and different backing storage devices. From what I 
understand from the documentation available, the idea is to always write 
sequentially, and thus avoid slow random writes on old/naive SSDs. Hence 
I have a few questions.


1) Modern SSDs (e.g. Intel) do this logical/physical mapping internally, 
so that the writes happen sequentially anyway. Does nilfs demonstrably 
provide additional benefits on such modern SSDs with sensible firmware?


2) Mechanical disks suffer from slow random writes (or any random 
operation for that matter), too. Do the benefits of nilfs show in random 
write performance on mechanical disks?


3) How does this affect real-world read performance if nilfs is used on 
a mechanical disk? How much additional file fragmentation in absolute 
terms does nilfs cause?


4) As the data gets expired, and snapshots get deleted, this will 
inevitably lead to fragmentation, which will de-linearize writes as they 
have to go into whatever holes are available in the data. How does this 
affect nilfs write performance?


5) How does the specific writing amount measure against other file 
systems (I'm specifically interested in comparisons vs. ext2). What I 
mean by specific writing amount is for writing, say, 100,000 random 
sized files, how many write operations and MBs (or sectors) of writes 
are required for the exact same operation being performed on nilfs and 
ext2 (e.g. as measured by vmstat -d).


Many thanks.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


TRIM Support?

2010-05-25 Thread Gordan Bobic

Hi,

I notice that the pitch for NILFS is that it is particularly suitable 
for flash based media. Does it have any sort of support for TRIM 
command? If not, is there at least an equivalent of dumpfs that could be 
used to get the list of free blocks that could be passed to hdparm to 
issue TRIM commands to the SSD?


TIA.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html