RE: Question about synchronous write on SSD

2008-02-19 Thread Kyungmin Park
> >
> > Agree, however see the following sequence.
> >
> > __generic_make_request call q->make_request_fn(q, bio);
> > It was set by blk_init_queue_node with __make_request.
> > There are two ways in __make_request.
> > Case 1, get_rq
> > Case 2, out or merged (otherwise you mean unplug case)
> >
> > In case 1, if the BIO_RW_SYNC is set, the request gets the REQ_RW_SYNC
> > And REQ_RW_SYNC says
> > "include/linux/blkdev.h":112: __REQ_RW_SYNC,  /* request is sync 
> > (O_DIRECT) */
> > It means it acts as O_DIRECT flag. Is it right?
> > And it also is same as case 2. Unplug the device.
> > So next time it hasn't chance to merge???
> 
> But that still doesn't make it sync. I think you are working the wrong
> way. For ssd we still want merging and plugging also makes sense to some
> degree, though it probably should be minimized. It'll only cause an
> initial latency, for busy IO workloads you wont be plugging/unplugging
> much anyway.
> 
> In fact your patch makes things WORSE, since the io schedulers will now
> treat the IO as sync and introduce idling for the disk head. And you
> definitely don't want that.

Yes, you're right. It's for testing.
I just want to know the worst or corner case, if all writes are synchronous.
Of course I can measure the using tiotest "Do write synchronous" option.
Then you think it's the worse case?

Kyungmin Park

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question about synchronous write on SSD

2008-02-19 Thread Kyungmin Park
> Le Tue, 19 Feb 2008 14:48:18 +0900,
> "Kyungmin Park" <[EMAIL PROTECTED]> a écrit :
> 
> > +   /* Write synchronous */
> > +   bio->bi_rw |= (1 << BIO_RW_SYNC);
> 
> Adding BIO_RW_SYNC doesn't make generic_make_request() synchronous as
> in "generic_make_request() returns only after write completion".
> BIO_RW_SYNC only asks the I/O layer to unplug immediatly. But
> generic_make_request() still returns before the completion of the I/O,
> and the completion is notified asynchronously.
> 

Agree, however see the following sequence.

__generic_make_request call q->make_request_fn(q, bio);
It was set by blk_init_queue_node with __make_request.
There are two ways in __make_request.
Case 1, get_rq
Case 2, out or merged (otherwise you mean unplug case)

In case 1, if the BIO_RW_SYNC is set, the request gets the REQ_RW_SYNC
And REQ_RW_SYNC says 
"include/linux/blkdev.h":112: __REQ_RW_SYNC,  /* request is sync 
(O_DIRECT) */
It means it acts as O_DIRECT flag. Is it right?
And it also is same as case 2. Unplug the device.
So next time it hasn't chance to merge???

Is it right?

BR,
Kyungmin Park

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about synchronous write on SSD

2008-02-18 Thread Kyungmin Park
Hi,

Don't you remember the topic "solid state drive access and context
switching" [1].
I want to measure it is really better performance on SSD?

To write it on ssd synchronously, I hacked the
'generic_make_request()' [2] and got following results.

# echo 3 > /proc/sys/vm/drop_caches
# tiotest -f 100 -R -d /dev/sdd1
Tiotest results for 4 concurrent io threads:
| Write 400 MBs |4.7 s |  84.306 MB/s |   8.4 %  |  77.5 % |
| Read  400 MBs |4.3 s |  92.945 MB/s |   7.2 %  |  53.5 % |
Tiotest latency results:
| Write|0.126 ms |  706.379 ms |  0.0 |   0.0 |
| Read |0.161 ms |  311.738 ms |  0.0 |   0.0 |

# echo 3 > /proc/sys/vm/drop_caches
# tiotest -f 1000 -R -d /dev/sdd1
Tiotest results for 4 concurrent io threads:
| Write4000 MBs |   47.5 s |  84.124 MB/s |   7.0 %  |  83.6 % |
| Read 4000 MBs |   41.9 s |  95.530 MB/s |   7.8 %  |  55.6 % |
Tiotest latency results:
| Write|0.176 ms |  714.677 ms |  0.0 |   0.0 |
| Read |0.161 ms |  311.815 ms |  0.0 |   0.0 |

However it's same performance as before. It means the patch is meaningless.

Could you tell me is it the proper place to hack or others?

Thank you,
Kyungmin Park

p.s. Of cource I got the following message
WARNING: at block/blk-core.c:1351 generic_make_request+0x126/0x3d8()

1. http://lkml.org/lkml/2007/12/3/247
2. simple hack
diff --git a/block/blk-core.c b/block/blk-core.c
index e9754dc..7262720 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1345,6 +1345,14 @@ static inline void
__generic_make_request(struct bio *bio)
if (bio_check_eod(bio, nr_sectors))
goto end_io;

+   /* FIXME simple hack by kmpark */
+   if (MINOR(bio->bi_bdev->bd_dev) == 49 &&
+   MAJOR(bio->bi_bdev->bd_dev) == 8 && bio_data_dir(bio) == WRITE) {
+   WARN_ON_ONCE(1);
+   /* Write synchronous */
+   bio->bi_rw |= (1 << BIO_RW_SYNC);
+   }
+
/*
 * Resolve the mapping until finished. (drivers are
 * still free to implement/resolve their own stacking
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] CRAMFS: Uncompressed files support

2008-01-27 Thread Kyungmin Park
> > This patch enables the uncompressed files support in cramfs.
> >
> > The word 'uncompressed file' is from linear cramfs (aka Application XIP).
> > In linear cramfs, it is used to suport XIP on NOR. However it is also 
> > helpful on OneNAND. It makes a filesystem faster by removing compression 
> > overhead.
> > In XIP mode it runs XIP, But non-XIP mode. It copies data to ram and runs.
> >
> > In my simple test, copy busybox (compressed or uncompressed).
> > It reduces the about 50% time saving from 0.40s to 0.19s.
> > Yes, it incrases the file system size, but nowadays flash has big capacity.
> > It's trade-off between size and performance.
> >
> > Also this patch uses the page cache directly.
> > In previous implementation, it used the local buffer. why?
> > It's already uncompressed and fits to page size. So It uses the page 
> > directly to remove useless memory copy.
> >
> > It's compatible the existing linear cramfs image and original one.
> >
> > Any comments are welcome.
> >
> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > index 3d194a2..edba28f 100644
> > --- a/fs/cramfs/inode.c
> > +++ b/fs/cramfs/inode.c
>
> I don't know what kernel this is against but it generates a lot of rejects
> against present mainline.
>

Sorry, it used another tree instead of mainline. Following patch will
be against mainline.

>
> Are you sure about the kmap/kunmap handling in this patch?  It looks like
> it might be unbalanced.
>

Yes, It's matching. Because cramfs_read returns virtual address, it
returns the kmap-ped address if it had page, then caller kunmap-ed and
release it. Otherwise pg doesn't have page pointer.

Anyway, here's more simple method using get_block sytle.
With this patch, it can also use multi page read, cramfs_readpages.
However, it has no big improvement in my environment.

The goal is same it uses the requested page directly instead of
useless memory copy.

Any comments are welcome.

Thank you,
Kyungmin Park

Signed-off-by: Kyungmin Park <[EMAIL PROTECTED]>

diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 350680f..2d6736d 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -40,6 +41,7 @@ static DEFINE_MUTEX(read_mutex);
 #define CRAMINO(x) (((x)->offset && (x)->size)?(x)->offset<<2:1)
 #define OFFSET(x)  ((x)->i_ino)

+#define CRAMFS_INODE_IS_XIP(x) ((x)->i_mode & S_ISVTX)

 static int cramfs_iget5_test(struct inode *inode, void *opaque)
 {
@@ -468,16 +470,32 @@ static struct dentry * cramfs_lookup(struct
inode *dir, struct dentry *dentry, s
return NULL;
 }

+static int cramfs_get_block(struct inode *inode, sector_t iblock,
+   struct buffer_head *bh_result, int create)
+{
+   /* A write? */
+   BUG_ON(unlikely(create));
+
+   iblock += (PAGE_ALIGN(OFFSET(inode)) >> PAGE_CACHE_SHIFT);
+   map_bh(bh_result, inode->i_sb, iblock);
+
+   return 0;
+}
+
 static int cramfs_readpage(struct file *file, struct page * page)
 {
struct inode *inode = page->mapping->host;
+   struct super_block *sb = inode->i_sb;
u32 maxblock, bytes_filled;
void *pgdata;

maxblock = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
bytes_filled = 0;
-   if (page->index < maxblock) {
-   struct super_block *sb = inode->i_sb;
+
+   /* Handle uncompressed files */
+   if (CRAMFS_INODE_IS_XIP(inode) && page->index < maxblock) {
+   return mpage_readpage(page, cramfs_get_block);
+   } else if (page->index < maxblock) {
u32 blkptr_offset = OFFSET(inode) + page->index*4;
u32 start_offset, compr_len;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Kyungmin Park
On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> On Tuesday 30 October 2007, Dongjun Shin wrote:
> > There is an ongoing discussion about adding 'Trim' ATA command for notifying
> > the drive about the deleted blocks.
> >
> > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
> >
> > This is especially useful for the storage device like Solid State Drive 
> > (SSD).
> >
> This make me curious, why would t13 want to invent a new command when
> there is already the erase command from CFA?
>
> It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
> should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
> http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf
>

IHMO, the main difference is that it requires the physical operation or not.
The CFA_ERAES erases the free blocks, it requires the physical erase operation.
But in trim case, it just unmapped the free blocks at FTL level. it
doesn't require the physical operation.
It's time saving and we can do a lot of works at FTL level internally.

Thank you,
Kyungmin Park

I
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Kyungmin Park
On 10/31/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> On Tuesday 30 October 2007, Jörn Engel wrote:
> > On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
> > > On 10/30/07, Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Not sure. Why shouldn't you be able to reorder the hints provided that
> > > > they don't overlap with read/write bios for the same block?
> > >
> > > You're right. The bios can be reordered if they don't overlap with hint.
> >
> > I would keep things simpler. Bios can be reordered, full stop. If an
> > erase and a write overlap, the caller (filesystem?) has to add a
> > barrier.
>
> I thought bios were already ordered if they affect the same blocks.
> Either way, I agree that an erase should not be treated special on
> the bio layer, its ordering should be handled the same way we do it
> for writes.
>

To support the new ATA command (trim, or dataset), the suggested hint
is not enough.
We have to send the bio with data (at least one sector or more) since
the new ATA command requests the dataset information.

And also we have to strictly follow the order using barrier or other
methods at filesystem level
For example, the delete operation in ext3.
1. delete some file
2. ext3_delete_inode() called
3. ... -> ext3_free_blocks_sb() releases the free blocks
4. If it sends the hints here, it breaks the ext3 power off recovery
scheme since it trims the data from given information after reboot
5. after transaction, all dirty pages are flushed. after this work, we
can trim the free blocks safely.

Another approach is modifying the block framework.
At  I/O scheduler, it don't merge the hint bio (in my terminology, bio
control info) with general bio. In this case we also consider the
reordering problem.
I'm not sure it is possible at this time.

Thank you,
Kyungmin Park
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html