Re: [PATCH 5/9] zram: pass queue_limits to blk_mq_alloc_disk

2024-02-19 Thread Sergey Senozhatsky
On (24/02/15 08:10), Christoph Hellwig wrote:
> 
> Pass the queue limits directly to blk_alloc_disk instead of setting them
> one at a time.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Sergey Senozhatsky 



Re: [PATCH v3, resend] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-03-15 Thread Sergey Senozhatsky
On (03/14/18 15:48), Bart Van Assche wrote:
> Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
> not been removed from uapi header files nor from NAND drivers in
> which these constants are used for another purpose than converting
> block layer offsets and sizes into a number of sectors.
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Johannes Thumshirn 
> Reviewed-by: Martin K. Petersen 
> Cc: Sergey Senozhatsky 

FWIW,
Reviewed-by: Sergey Senozhatsky 

-ss


Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-13 Thread Sergey Senozhatsky
On (02/12/18 11:05), Bart Van Assche wrote:
[..]
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ac4740cf74be..cf17626604c2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const struct 
> request *rq)
>  
>  extern unsigned int blk_rq_err_bytes(const struct request *rq);
>  
> +/*
> + * Variables of type sector_t represent an offset or size that is a multiple 
> of
> + * 2**9 bytes. Hence these two constants.
> + */
> +#ifndef SECTOR_SHIFT
> +enum { SECTOR_SHIFT = 9 };
> +#endif
> +#ifndef SECTOR_SIZE
> +enum { SECTOR_SIZE = 512 };
> +#endif

Shouldn't SECTOR_SIZE depend on SECTOR_SHIFT?

1 << SECTOR_SHIFT

-ss


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-30 Thread Sergey Senozhatsky
Hello Peter,

On (08/30/17 10:47), Peter Zijlstra wrote:
[..]
> On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote:
> > 
> > So the overhead looks to be spread out over all sorts, which makes it
> > harder to find and fix.
> > 
> > stack unwinding is done lots and is fairly expensive, I've not yet
> > checked if crossrelease does too much of that.
> 
> Aah, we do an unconditional stack unwind for every __lock_acquire() now.
> It keeps a trace in the xhlocks[].
> 
> Does the below cure most of that overhead?

thanks.

mostly yes. the kernel is not so dramatically slower now. it's still
seems to be a bit slower, which is expected I suppose, but over all
it's much better

real1m35.209s
user4m12.467s
sys 0m49.457s

// approx 10 seconds slower.

-ss


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-29 Thread Sergey Senozhatsky
Hi,

On (08/30/17 14:43), Byungchul Park wrote:
[..]
> > notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim
> > is irritatingly slow)
> 
> To Ingo,
> 
> I cannot decide if we have to roll back CONFIG_LOCKDEP_CROSSRELEASE
> dependency on CONFIG_PROVE_LOCKING in Kconfig. With them enabled,
> lockdep detection becomes strong but has performance impact. But,
> it's anyway a debug option so IMHO we don't have to take case of the
> performance impact. Please let me know your decision.

well, I expected it :)

I've been running lockdep enabled kernels for years, and was OK with
the performance. but now it's just too much and I'm looking at disabling
lockdep.

a more relevant test -- compilation of a relatively small project

  LOCKDEP -CROSSRELEASE -COMPLETIONS LOCKDEP +CROSSRELEASE +COMPLETIONS

   real1m23.722s  real2m9.969s
   user4m11.300s  user4m15.458s
   sys 0m49.386s  sys 2m3.594s


you don't want to know how much time now it takes to recompile the
kernel ;)

-ss


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-29 Thread Sergey Senozhatsky
On (08/23/17 09:03), Byungchul Park wrote:
[..]
> > Byungchul, did you add the crosslock checks to lockdep? Can you have a look 
> > at
> > the above report? That report namely doesn't make sense to me.
> 
> The report is talking about the following lockup:
> 
> A work in a worker A task work on exit to user
> -- ---
> mutex_lock(&bdev->bd_mutex)
>mutext_lock(&bdev->bd_mutex)
> blk_execute_rq()
>wait_for_completion_io_timeout(&A)
>complete(&A)
> 
[..]
> To Peterz,
> 
> Anyway I wanted to avoid lockdep reports in the case using a timeout
> interface. Do you think it's still worth reporting the kind of lockup?
> I'm ok if you do.

Byungchul, a quick question.
have you measured the performance impact? somehow my linux-next is
notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim
is irritatingly slow)


`time dmesg' shows some difference, but probably that's not a good
test.

!LOCKDEPLOCKDEP LOCKDEP -CROSSRELEASE -COMPLETIONS
real 0m0.661s   0m2.290s0m1.920s
user 0m0.010s   0m0.105s0m0.000s
sys  0m0.636s   0m2.224s0m1.888s

anyone else "sees"/"can confirm" the slow down?


it gets back to "usual normal" when I disable CROSSRELEASE and COMPLETIONS.

---

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index b19c491cbc4e..cdc30ef81c5e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1091,8 +1091,6 @@ config PROVE_LOCKING
select DEBUG_MUTEXES
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_LOCK_ALLOC
-   select LOCKDEP_CROSSRELEASE
-   select LOCKDEP_COMPLETIONS
select TRACE_IRQFLAGS
default n
help

---

-ss


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-23 Thread Sergey Senozhatsky
Hi,

On (08/24/17 12:39), Boqun Feng wrote:
> On Wed, Aug 23, 2017 at 02:55:17PM +0900, Sergey Senozhatsky wrote:
> > On (08/23/17 13:35), Boqun Feng wrote:
> > > > KERN_CONT and "\n" should not be together. "\n" flushes the cont
> > > > buffer immediately.
> > > > 
> > > 
> > > Hmm.. Not quite familiar with printk() stuffs, but I could see several
> > > usages of printk(KERN_CONT "...\n") in kernel.
> > > 
> > > Did a bit research myself, and I now think the inappropriate use is to
> > > use a KERN_CONT printk *after* another printk ending with a "\n".
> > 
> > ah... I didn't check __print_lock_name(): it leaves unflushed cont buffer
> > upon the return. sorry, your code is correct.
> > 
> 
> So means printk(KERN_CON "..."); + printk(KERN_CONT "...\n") is a
> correct usage, right?

well, yes. with one precondition - there should be no printk-s from other
CPUs/tasks in between

printk(KERN_CON "...");  +  printk(KERN_CONT "...\n")
   ^
here we can have a preliminary flush and broken
cont line. but it's been this way forever.

-ss


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-22 Thread Sergey Senozhatsky
On (08/23/17 13:35), Boqun Feng wrote:
> > KERN_CONT and "\n" should not be together. "\n" flushes the cont
> > buffer immediately.
> > 
> 
> Hmm.. Not quite familiar with printk() stuffs, but I could see several
> usages of printk(KERN_CONT "...\n") in kernel.
> 
> Did a bit research myself, and I now think the inappropriate use is to
> use a KERN_CONT printk *after* another printk ending with a "\n".

ah... I didn't check __print_lock_name(): it leaves unflushed cont buffer
upon the return. sorry, your code is correct.

-ss

> > >   printk("\n *** DEADLOCK ***\n\n");
> > > + } else if (cross_lock(src->instance)) {
> > > + printk(" Possible unsafe locking scenario by crosslock:\n\n");
> > > + printk("   CPU0CPU1\n");
> > > + printk("   \n");
> > > + printk("  lock(");
> > > + __print_lock_name(target);
> > > + printk(KERN_CONT ");\n");
> > > + printk("  lock(");
> > > + __print_lock_name(source);
> > > + printk(KERN_CONT ");\n");
> > > + printk("   lock(");
> > > + __print_lock_name(parent == source ? target : parent);
> > > + printk(KERN_CONT ");\n");
> > > + printk("   unlock(");
> > > + __print_lock_name(source);
> > > + printk(KERN_CONT ");\n");
> > > + printk("\n *** DEADLOCK ***\n\n");
> > >   } else {
> > >   printk(" Possible unsafe locking scenario:\n\n");
> > >   printk("   CPU0CPU1\n");
> > > -- 
> > > 2.14.1
> > > 




Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-22 Thread Sergey Senozhatsky
On (08/23/17 13:35), Boqun Feng wrote:
[..]
> > >   printk(KERN_CONT ");\n");
> > 
> > KERN_CONT and "\n" should not be together. "\n" flushes the cont
> > buffer immediately.
> > 
> 
> Hmm.. Not quite familiar with printk() stuffs, but I could see several
> usages of printk(KERN_CONT "...\n") in kernel.
> 
> Did a bit research myself, and I now think the inappropriate use is to
> use a KERN_CONT printk *after* another printk ending with a "\n". Am I
> missing some recent changes or rules of KERN_CONT?

has been this way for quite some time (if not always).

LOG_NEWLINE results in cont_flush(), which log_store() the content
of KERN_CONT buffer.

if we see that supplied message has no \n then we store it in a
dedicated buffer (cont buffer)

if (!(lflags & LOG_NEWLINE))
return cont_add();

return log_store();

we flush that buffer (move its content to the kernel log buffer) when
we receive a message with a \n or when printk() from another task/context
interrupts the current cont line and, thus, forces us to flush.

-ss


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-22 Thread Sergey Senozhatsky
On (08/23/17 12:38), Boqun Feng wrote:
[..]
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 642fb5362507..a3709e15f609 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -1156,6 +1156,23 @@ print_circular_lock_scenario(struct held_lock *src,
>   __print_lock_name(target);
>   printk(KERN_CONT ");\n");

KERN_CONT and "\n" should not be together. "\n" flushes the cont
buffer immediately.

-ss

>   printk("\n *** DEADLOCK ***\n\n");
> + } else if (cross_lock(src->instance)) {
> + printk(" Possible unsafe locking scenario by crosslock:\n\n");
> + printk("   CPU0CPU1\n");
> + printk("   \n");
> + printk("  lock(");
> + __print_lock_name(target);
> + printk(KERN_CONT ");\n");
> + printk("  lock(");
> + __print_lock_name(source);
> + printk(KERN_CONT ");\n");
> + printk("   lock(");
> + __print_lock_name(parent == source ? target : parent);
> + printk(KERN_CONT ");\n");
> + printk("   unlock(");
> + __print_lock_name(source);
> + printk(KERN_CONT ");\n");
> + printk("\n *** DEADLOCK ***\n\n");
>   } else {
>   printk(" Possible unsafe locking scenario:\n\n");
>   printk("   CPU0CPU1\n");
> -- 
> 2.14.1
> 


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-22 Thread Sergey Senozhatsky
On (08/23/17 09:03), Byungchul Park wrote:
[..]

aha, ok

> The report is talking about the following lockup:
> 
> A work in a worker A task work on exit to user
> -- ---
> mutex_lock(&bdev->bd_mutex)
>mutext_lock(&bdev->bd_mutex)
> blk_execute_rq()
>wait_for_completion_io_timeout(&A)
>complete(&A)
> 
> Is this impossible?

I was really confused how this "unlock" may lead to a deadlock

> > >  other info that might help us debug this:
> > >  Possible unsafe locking scenario by crosslock:
> > >CPU0CPU1
> > >
> > >   lock(&bdev->bd_mutex);
> > >   lock((complete)&wait#2);
> > >lock(&bdev->bd_mutex);
> > >unlock((complete)&wait#2);


any chance the report can be improved? mention timeout, etc?
// well, if this functionality will stay.


p.s.
Bart Van Assche, thanks for Cc-ing Park Byungchul, I was really
sure I didn't enabled the cross-release, but apparently I was wrong:
 CONFIG_LOCKDEP_CROSSRELEASE=y
 CONFIG_LOCKDEP_COMPLETIONS=y

-ss


possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-22 Thread Sergey Senozhatsky
Hello,

==
WARNING: possible circular locking dependency detected
4.13.0-rc6-next-20170822-dbg-00020-g39758ed8aae0-dirty #1746 Not tainted
--
fsck.ext4/148 is trying to acquire lock:
 (&bdev->bd_mutex){+.+.}, at: [] __blkdev_put+0x33/0x190

 but now in release context of a crosslock acquired at the following:
 ((complete)&wait#2){+.+.}, at: [] blk_execute_rq+0xbb/0xda

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 ((complete)&wait#2){+.+.}:
   lock_acquire+0x176/0x19e
   __wait_for_common+0x50/0x1e3
   blk_execute_rq+0xbb/0xda
   scsi_execute+0xc3/0x17d [scsi_mod]
   sd_revalidate_disk+0x112/0x1549 [sd_mod]
   rescan_partitions+0x48/0x2c4
   __blkdev_get+0x14b/0x37c
   blkdev_get+0x191/0x2c0
   device_add_disk+0x2b4/0x3e5
   sd_probe_async+0xf8/0x17e [sd_mod]
   async_run_entry_fn+0x34/0xe0
   process_one_work+0x2af/0x4d1
   worker_thread+0x19a/0x24f
   kthread+0x133/0x13b
   ret_from_fork+0x27/0x40

 -> #0 (&bdev->bd_mutex){+.+.}:
   __blkdev_put+0x33/0x190
   blkdev_close+0x24/0x27
   __fput+0xee/0x18a
   task_work_run+0x79/0xa0
   prepare_exit_to_usermode+0x9b/0xb5

 other info that might help us debug this:
 Possible unsafe locking scenario by crosslock:
   CPU0CPU1
   
  lock(&bdev->bd_mutex);
  lock((complete)&wait#2);
   lock(&bdev->bd_mutex);
   unlock((complete)&wait#2);

  *** DEADLOCK ***
4 locks held by fsck.ext4/148:
 #0:  (&bdev->bd_mutex){+.+.}, at: [] __blkdev_put+0x33/0x190
 #1:  (rcu_read_lock){}, at: [] rcu_lock_acquire+0x0/0x20
 #2:  (&(&host->lock)->rlock){-.-.}, at: [] 
ata_scsi_queuecmd+0x23/0x74 [libata]
 #3:  (&x->wait#14){-...}, at: [] complete+0x18/0x50

 stack backtrace:
CPU: 1 PID: 148 Comm: fsck.ext4 Not tainted 
4.13.0-rc6-next-20170822-dbg-00020-g39758ed8aae0-dirty #1746
Call Trace:
 dump_stack+0x67/0x8e
 print_circular_bug+0x2a1/0x2af
 ? zap_class+0xc5/0xc5
 check_prev_add+0x76/0x20d
 ? __lock_acquire+0xc27/0xcc8
 lock_commit_crosslock+0x327/0x35e
 complete+0x24/0x50
 scsi_end_request+0x8d/0x176 [scsi_mod]
 scsi_io_completion+0x1be/0x423 [scsi_mod]
 __blk_mq_complete_request+0x112/0x131
 ata_scsi_simulate+0x212/0x218 [libata]
 __ata_scsi_queuecmd+0x1be/0x1de [libata]
 ata_scsi_queuecmd+0x41/0x74 [libata]
 scsi_dispatch_cmd+0x194/0x2af [scsi_mod]
 scsi_queue_rq+0x1e0/0x26f [scsi_mod]
 blk_mq_dispatch_rq_list+0x193/0x2a7
 ? _raw_spin_unlock+0x2e/0x40
 blk_mq_sched_dispatch_requests+0x132/0x176
 __blk_mq_run_hw_queue+0x59/0xc5
 __blk_mq_delay_run_hw_queue+0x5f/0xc1
 blk_mq_flush_plug_list+0xfc/0x10b
 blk_flush_plug_list+0xc6/0x1eb
 blk_finish_plug+0x25/0x32
 generic_writepages+0x56/0x63
 do_writepages+0x36/0x70
 __filemap_fdatawrite_range+0x59/0x5f
 filemap_write_and_wait+0x19/0x4f
 __blkdev_put+0x5f/0x190
 blkdev_close+0x24/0x27
 __fput+0xee/0x18a
 task_work_run+0x79/0xa0
 prepare_exit_to_usermode+0x9b/0xb5
 entry_SYSCALL_64_fastpath+0xab/0xad
RIP: 0033:0x7ff5755a2f74
RSP: 002b:7ffe46fce038 EFLAGS: 0246 ORIG_RAX: 0003
RAX:  RBX: 555ddeddded0 RCX: 7ff5755a2f74
RDX: 1000 RSI: 555ddede2580 RDI: 0004
RBP:  R08: 555ddede2580 R09: 555ddedde080
R10: 00010800 R11: 0246 R12: 555ddedddfa0
R13: 7ff576523680 R14: 1000 R15: 555ddeddc2b0

-ss


Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-17 Thread Sergey Senozhatsky
Hello Minchan,

On (05/17/17 17:32), Minchan Kim wrote:
[..]
> > what we can return now is a `partially updated' data, with some new
> > and some stale pages. this is quite unlikely to end up anywhere good.
> > am I wrong?
> > 
> > why does `rd block 4' in your case causes Oops? as a worst case scenario?
> > application does not expect page to be 'all A' at this point. pages are
> > likely to belong to some mappings/files/etc., and there is likely a data
> > dependency between them, dunno C++ objects that span across pages or
> > JPEG images, etc. so returning "new data   new data   stale data" is a bit
> > fishy.
> 
> I thought more about it and start to confuse. :/

sorry, I'm not sure I see what's the source of your confusion :)

my point is - we should not let READ succeed if we know that WRITE
failed. assume JPEG image example,


over-write block 1 aaa->xxx OK
over-write block 2 bbb->yyy OK
over-write block 3 ccc->zzz error

reading that JPEG file

read block 1 xxx OK
read block 2 yyy OK
read block 3 ccc OK   << we should not return OK here. because
 "xxxyyyccc" is not the correct JPEG file
 anyway.

do you agree that telling application that read() succeeded and at
the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is
not correct?



so how about this,

- if we fail to compress page (S/W or H/W compressor error, depending
  on particular setup) let's store it uncompressed (page_size-d zspool
  object).

?

this should do the trick. at least we will have correct data:
xxx  - compressed
yyy  - compressed
zzz  - uncompressed, because compressing back-end returned an error.

thoughts?

-ss


Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses

2017-03-06 Thread Sergey Senozhatsky
On (03/06/17 11:23), Johannes Thumshirn wrote:
> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> pages attached to the bio's bvec this results in a kernel panic because of
> array out of bounds accesses in zram_decompress_page().
> 
> Signed-off-by: Johannes Thumshirn 

Cc Andrew
Link: lkml.kernel.org/r/20170306102335.9180-1-jthumsh...@suse.de


Reviewed-by: Sergey Senozhatsky 

thanks!

-ss


Re: [RFC] blk: increase logical_block_size to unsigned int

2017-01-09 Thread Sergey Senozhatsky
On (01/09/17 14:04), Minchan Kim wrote:
> Mostly, zram is used as swap system on embedded world so it want to do IO
> as PAGE_SIZE aligned/size IO unit. For that, one of the problem was
> blk_queue_logical_block_size(zram->disk->queue, PAGE_SIZE) made overflow
> in *64K page system* so [1] changed it to constant 4096.
> Since that, partial IO can happen so zram should handle it which makes zram
> complicated[2].
> 

I thought that zram partial IO support is there because some file
systems cannot cope with large logical_block_size. like FAT, for
example. am I wrong?

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] zram: support page-based parallel write

2016-10-20 Thread Sergey Senozhatsky
Hello Minchan,

On (10/17/16 14:04), Minchan Kim wrote:
> Hi Sergey,
> 
> On Fri, Oct 07, 2016 at 03:33:22PM +0900, Minchan Kim wrote:
> 
> < snip >
> 
> > > so the question is -- can we move this parallelization out of zram
> > > and instead flush bdi in more than one kthread? how bad that would
> > > be? can anyone else benefit from this?
> > 
> > Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
> > 
> > 1. read speed degradation
> > 2. no work with rw_page
> > 3. more memory footprint by bio/request queue allocation
> > 
> > Having said, it's worth to look into it in detail more.
> > I will have time to see that approach to know what I can do
> > with that.
> 
> queue_mode=2 bs=4096 nr_devices=1 submit_queues=4 hw_queue_depth=128
> 
> Last week, I played with null_blk and blk-mq.c to get an idea how
> blk-mq works and I realized it's not good for zram because it aims
> to solve 1) dispatch queue bottleneck 2) cache-friendly IO completion
> through IRQ so 3) avoids remote memory accesses.
> 
> For zram which is used for embedded as primary purpose, ones listed
> abvoe are not a severe problem. Most imporant thing is there is no
> model to support that a process queueing IO request on *a* CPU while
> other CPUs issues the queued IO to driver.
> 
> Anyway, Although blk-mrq can support that model, it is blk-layer thing.
> IOW, it's software stuff for fast IO delievry but what we need is
> device parallelism of zram itself. So, although we follow blk-mq,
> we still need multiple threads to compress in parallel which is most of
> code I wrote in this patchset.

yes. but at least wb can be multi-threaded. well, sort of. seems like.
sometimes.

> If I cannot get huge benefit(e.g., reduce a lot of zram-speicif code
> to support such model) with blk-mq, I don't feel to switch to request
> model at the cost of reasons I stated above.

thanks.
I'm looking at your patches.

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] zram: support page-based parallel write

2016-10-09 Thread Sergey Senozhatsky
Hello Minchan,

On (10/07/16 15:33), Minchan Kim wrote:
[..]
> > as soon as wb flush kworker can't keep up anymore things are going off
> > the rails. most of the time, fio-template-static-buffer are in D state,
> > while the biggest bdi flush kworker is doing the job (a lot of job):
> > 
> >   PID USER  PR  NIVIRTRES  %CPU %MEM TIME+ S COMMAND
> >  6274 root  20   00.0m   0.0m 100.0  0.0   1:15.60 R [kworker/u8:1]
> > 11169 root  20   0  718.1m   1.6m  16.6  0.0   0:01.88 D fio 
> > ././conf/fio-template-static-buffer
> > 11171 root  20   0  718.1m   1.6m   3.3  0.0   0:01.15 D fio 
> > ././conf/fio-template-static-buffer
> > 11170 root  20   0  718.1m   3.3m   2.6  0.1   0:00.98 D fio 
> > ././conf/fio-template-static-buffer
> > 
> > 
> > and still working...
> > 
> >  6274 root  20   00.0m   0.0m 100.0  0.0   3:05.49 R [kworker/u8:1]
> > 12048 root  20   0  718.1m   1.6m  16.7  0.0   0:01.80 R fio 
> > ././conf/fio-template-static-buffer
> > 12047 root  20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio 
> > ././conf/fio-template-static-buffer
> > 12049 root  20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio 
> > ././conf/fio-template-static-buffer
> > 12050 root  20   0  718.1m   1.6m   2.0  0.0   0:00.98 D fio 
> > ././conf/fio-template-static-buffer
> > 
> > and working...
[..]
> Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
> 
> 1. read speed degradation
> 2. no work with rw_page
> 3. more memory footprint by bio/request queue allocation

yes, I did. and I've seen your concerns in another email - I
just don't have enough knowledge at the moment to say something
not entirely stupid. gotta look more at the whole thing.

> Having said, it's worth to look into it in detail more.
> I will have time to see that approach to know what I can do
> with that.

thanks a lot!
will keep looking as well.

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] zram: support page-based parallel write

2016-10-06 Thread Sergey Senozhatsky
Hello Minchan,

On (10/05/16 11:01), Minchan Kim wrote:
[..]
> 1. just changed ordering of test execution - hope to reduce testing time due 
> to
>block population before the first reading or reading just zero pages
> 2. used sync_on_close instead of direct io
> 3. Don't use perf to avoid noise
> 4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior

ok, will use it in the tests below.

> 1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 
> ./zram-fio-test.sh
> 2. modify script to disable aio via /sys/block/zram0/use_aio
>ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 
> ./zram-fio-test.sh
>
>   seq-write 380930 474325 124.52%
>  rand-write 286183 357469 124.91%
>seq-read 266813 265731  99.59%
>   rand-read 211747 210670  99.49%
>mixed-seq(R) 145750 171232 117.48%
>mixed-seq(W) 145736 171215 117.48%
>   mixed-rand(R) 115355 125239 108.57%
>   mixed-rand(W) 115371 125256 108.57%

no_aio   use_aio

WRITE:  1432.9MB/s   1511.5MB/s
WRITE:  1173.9MB/s   1186.9MB/s
READ:   912699KB/s   912170KB/s
WRITE:  912497KB/s   911968KB/s
READ:   725658KB/s   726747KB/s
READ:   579003KB/s   594543KB/s
READ:   373276KB/s   373719KB/s
WRITE:  373572KB/s   374016KB/s

seconds elapsed45.399702511 44.280199716

> LZO compression is fast and a CPU for queueing while 3 CPU for compressing
> it cannot saturate CPU full bandwidth. Nonetheless, it shows 24% enhancement.
> It could be more in slow CPU like embedded.
> 
> I tested it with deflate. The result is 300% enhancement.
> 
>   seq-write  33598 109882 327.05%
>  rand-write  32815 102293 311.73%
>seq-read 154323 153765  99.64%
>   rand-read 129978 129241  99.43%
>mixed-seq(R)  15887  44995 283.22%
>mixed-seq(W)  15885  44990 283.22%
>   mixed-rand(R)  25074  55491 221.31%
>   mixed-rand(W)  25078  55499 221.31%
>
> So, curious with your test.
> Am my test sync with yours? If you cannot see enhancment in job1, could
> you test with deflate? It seems your CPU is really fast.

interesting observation.

no_aio   use_aio
WRITE:  47882KB/s158931KB/s
WRITE:  47714KB/s156484KB/s
READ:   42914KB/s137997KB/s
WRITE:  42904KB/s137967KB/s
READ:   333764KB/s   332828KB/s
READ:   293883KB/s   294709KB/s
READ:   51243KB/s129701KB/s
WRITE:  51284KB/s129804KB/s

seconds elapsed480.869169882181.678431855

yes, looks like with lzo CPU manages to process bdi writeback fast enough
to keep fio-template-static-buffer worker active.

to prove this theory: direct=1 cures zram-deflate.

no_aio   use_aio
WRITE:  41873KB/s34257KB/s
WRITE:  41455KB/s34087KB/s
READ:   36705KB/s28960KB/s
WRITE:  36697KB/s28954KB/s
READ:   327902KB/s   327270KB/s
READ:   316217KB/s   316886KB/s
READ:   35980KB/s28131KB/s
WRITE:  36008KB/s28153KB/s

seconds elapsed515.575252170629.114626795



as soon as wb flush kworker can't keep up anymore things are going off
the rails. most of the time, fio-template-static-buffer are in D state,
while the biggest bdi flush kworker is doing the job (a lot of job):

  PID USER  PR  NIVIRTRES  %CPU %MEM TIME+ S COMMAND
 6274 root  20   00.0m   0.0m 100.0  0.0   1:15.60 R [kworker/u8:1]
11169 root  20   0  718.1m   1.6m  16.6  0.0   0:01.88 D fio 
././conf/fio-template-static-buffer
11171 root  20   0  718.1m   1.6m   3.3  0.0   0:01.15 D fio 
././conf/fio-template-static-buffer
11170 root  20   0  718.1m   3.3m   2.6  0.1   0:00.98 D fio 
././conf/fio-template-static-buffer


and still working...

 6274 root  20   00.0m   0.0m 100.0  0.0   3:05.49 R [kworker/u8:1]
12048 root  20   0  718.1m   1.6m  16.7  0.0   0:01.80 R fio 
././conf/fio-template-static-buffer
12047 root  20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio 
././conf/fio-template-static-buffer
12049 root  20   0  718.1m   1.6m   3.3  0.0   0:01.12 D fio 
././conf/fio-template-static-buffer
12050 root  20   0  718.1m   1.6m   2.0  0.0   0:00.98 D fio 
././conf/fio-template-static-buffer

and working...


[ 4159.338731] CPU: 0 PID: 105 Comm: kworker/u8:4
[ 4159.338734] Workqueue: writeback wb_workfn (flush-254:0)
[ 4159.338746]  [] zram_make_request+0x4a3/0x67b [zram]
[ 4159.338748]  [] ? try_to_wake_up+0x201/0x213
[ 4159.338750]  [] ? mempool_alloc+0x5e/0x124
[ 4159.338752]  [] generic_make_request+0xb8/0x156
[ 4159.

Re: [PATCH 15/15] block: Add FIXME comment to handle device_add_disk error

2016-08-17 Thread Sergey Senozhatsky
On (08/17/16 15:15), Fam Zheng wrote:
[..]
>   (
> rc = device_add_disk(e1, e2, e3);
>   |
>   + /* FIXME: handle error. */
> device_add_disk(e1, e2, e3);

or use __must_check for device_add_disk() function?
/* which is _attribute__((warn_unused_result)) */

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] zram: Pass attribute group to device_add_disk

2016-08-17 Thread Sergey Senozhatsky
On (08/18/16 10:59), Sergey Senozhatsky wrote:
[..]
> I like the previous "Error creating sysfs group for device" string better,
> than "Error creating disk", because the latter one is much less informative.
> 
> do you want to do something like below?
> 
> int device_add_disk(struct device *parent, struct gendisk *disk,
> ...
>if (attr_group) {
>retval = sysfs_create_group(&disk_to_dev(disk)->kobj,
>attr_group);
> 
> + pr_err("Error creating sysfs group for device ...\n", ...);

d'oh... a typo. pr_err() is meant to be in `if (retval)' branch.

>if (retval)
>goto fail;
>}

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] zram: Pass attribute group to device_add_disk

2016-08-17 Thread Sergey Senozhatsky
Hello,

On (08/17/16 15:15), Fam Zheng wrote:
[..]
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 20920a2..2331788 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1298,13 +1298,10 @@ static int zram_add(void)
>   zram->disk->queue->limits.discard_zeroes_data = 0;
>   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
>  
> - device_add_disk(NULL, zram->disk, NULL);
> + ret = device_add_disk(NULL, zram->disk, &zram_disk_attr_group);
>  
> - ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
> - &zram_disk_attr_group);
>   if (ret < 0) {
> - pr_err("Error creating sysfs group for device %d\n",
> - device_id);
> + pr_err("Error creating disk %d\n", device_id);
>   goto out_free_disk;
>   }
>   strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));

I like the previous "Error creating sysfs group for device" string better,
than "Error creating disk", because the latter one is much less informative.

do you want to do something like below?

int device_add_disk(struct device *parent, struct gendisk *disk,
...
   if (attr_group) {
   retval = sysfs_create_group(&disk_to_dev(disk)->kobj,
   attr_group);

+   pr_err("Error creating sysfs group for device ...\n", ...);

   if (retval)
   goto fail;
   }

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html