Re: [Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-15 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 

On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:

(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 9 -
  1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 74a554a..10d2314 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
  
  
-	/* In case of eng_state != COMPLETE, we'll do FUA in

-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init(&bl);
  
  	if (iblk == PLOOP_ZERO_INDEX)


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-15 Thread Maxim Patlasov

ACK-ed, but see a minor nit below

On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 22 +-
  1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..74a554a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,16 +517,18 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
  
  	file_start_write(io->files.file);
  
-	/* Here io->io_count is even ... */

-   spin_lock_irq(&plo->lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
-   spin_unlock_irq(&plo->lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
+   spin_unlock_irq(&plo->lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
@@ -535,9 +537,11 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (!err && (preq->req_rw & REQ_FUA))


s/(preq->req_rw & REQ_FUA)/force_sync

Thanks,
Max


err = io->ops->sync(io);
  
-	spin_lock_irq(&plo->lock);

-   io->io_count++;
-   spin_unlock_irq(&plo->lock);
+   if (!force_sync) {
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   spin_unlock_irq(&plo->lock);
+   }
/* and here io->io_count is even (+2) again. */
  
  	file_end_write(io->files.file);


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-15 Thread Maxim Patlasov

Dima,

I agree that the ploop barrier code is broken in many ways, but I don't 
think the patch actually fixes it. I hope you would agree that 
completion of REQ_FUA guarantees only landing that particular bio to the 
disk; it says nothing about flushing previously submitted (and 
completed) bio-s and it is also possible that power outage may catch us 
when this REQ_FUA is already landed to the disk, but previous bio-s are 
not yet.


Hence, for RELOC_{A|S} requests we actually need something like that:

 RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
 RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA

(i.e. we do need to flush all previously submitted data before starting 
to update BAT on disk)


not simply:


RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA


Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
could remove them completely (along we that optimization delaying 
incoming FUA) and re-implement all this stuff from scratch:


1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
REQ_FUA in preq->req_rw before calling ->submit(preq)


2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
RELOC_A|S in ploop_index_update and map_wb_complete


3) For that optimization delaying incoming FUA (what we do now if 
ploop_req_delay_fua_possible() returns true) we could introduce new 
ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
and map_wb_complete (the same thing as 2) above). And, yes, let's 
WARN_ON if we somehow missed its processing.


The only complication I foresee is about how to teach kaio to pre-flush 
in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
already pretty convoluted.


Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
we checks for REQ_FUA after clearing it! This makes all FUA-s on 
ordinary kaio_submit path silently lost...


Thanks,
Maxim


On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
   ->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

->submit()

This patch unify barrier handling like follows:
- Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
- Perform explicit FUA inside index_update for RELOC requests.

This makes reloc sequence optimal:
RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c | 10 +++---
  drivers/block/ploop/map.c | 29 -
  2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..998fe71 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,11 @@ static void ploop_complete_request(struct ploop_request 
* preq)
  
  	__TRACE("Z %p %u\n", preq, preq->req_cluster);
  
+	if (!preq->error) {

+   unsigned long state =  READ_ONCE(preq->state);
+   WARN_ON(state & (1 << PLOOP_REQ_FORCE_FUA));
+   WARN_ON(state & (1 bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2535,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
  
-		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3a6365d..c17e598 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -896,6 +896,7 @@ void ploop_index_update(struct ploop_request * preq)
struct ploop_device * plo = preq->plo;
struct map_node * m = preq->map;
 

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-16 Thread Maxim Patlasov

On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:

Dmitry Monakhov  writes:


Maxim Patlasov  writes:


Dima,

I agree that the ploop barrier code is broken in many ways, but I don't
think the patch actually fixes it. I hope you would agree that
completion of REQ_FUA guarantees only landing that particular bio to the
disk; it says nothing about flushing previously submitted (and
completed) bio-s and it is also possible that power outage may catch us
when this REQ_FUA is already landed to the disk, but previous bio-s are
not yet.

Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
So yes. it would be more correct to tag WBI with FLUSH_FUA

Hence, for RELOC_{A|S} requests we actually need something like that:

   RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
   RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA

(i.e. we do need to flush all previously submitted data before starting
to update BAT on disk)


Correct sequence:
RELOC_S: R1, W2, WBI:FLUSH_FUA
RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA


not simply:


RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
could remove them completely (along we that optimization delaying
incoming FUA) and re-implement all this stuff from scratch:

1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
REQ_FUA in preq->req_rw before calling ->submit(preq)

2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
RELOC_A|S in ploop_index_update and map_wb_complete

3) For that optimization delaying incoming FUA (what we do now if
ploop_req_delay_fua_possible() returns true) we could introduce new
ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
and map_wb_complete (the same thing as 2) above). And, yes, let's
WARN_ON if we somehow missed its processing.

Yes. This was one of my ideas.
1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
PLOOP_IO_FLUSH_DELAYED.
2) fix ->write_page to handle flush as it does with fua.

The only complication I foresee is about how to teach kaio to pre-flush
in kaio_write_page -- it's doable, but involves kaio_resubmit that's
already pretty convoluted.


Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.

Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
which is async and not suitable for page_write.


I think it's doable to process page_write via kaio_fsync_thread, but 
it's tricky.



Max let's make an agreement about terminology.
The reason I wrote this is because linux internally interpret FUA as
preflush,write,postflush which is wrong from academic point of view but
it is the world we live it linux.


Are you sure that this  (FUA == preflush,write,postflush) is universally 
true (i.e. no exceptions)? What about bio-based block-device drivers?



This is the reason I read code
diferently from the way it was designed.
Let's state that ploop is an ideal world where:
FLUSH ==> preflush
FUA   ==> WRUTE,postflush


In ideal word FUA is not obliged to be handled by postflush: it's enough 
to guarantee that *this* particular request went to platter, other 
requests may remain not-flushed-yet. 
Documentation/block/writeback_cache_control.txt is absolutely clear 
about it:


The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted 
from the

filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.
...
If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.




For what reasona we can perform reloc scheme as:

RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
RELOC_A: R1,W2:FUA,WBI:FUA

This allows effectively handle FUA and convert it to DELAYED_FLUSH where
possible.


Ploop has the concept of map_multi_updates. In short words, while you're 
handling one update, many others may come to PLOOP_E_INDEX_DELAY state. 
And, as soon as the first one is done, we modify many indices in one 
loop (see map_wb_complete), then write that page to disk only once. 
Having map_multi_update in mind, it may be suboptimal to make many 
W2:FUA-s -- it may be better to do many ordinary W2-s instead, and only 
one pre-FLUSH later -- when we're going to write BAT page on disk.



Also let's clarify may_fua_delay semantics to exact eng_state

may_fua_delay {

   int may_delay = 1;
   /* effectively this is equivalent of
  preq->eng_state != PLOOP_E_COMPLETE
  but it is more readable, and more error prone in future
   */
   if (preq->eng_state != PLOOP_E_DATA_WBI)
   may_delay = 0
   if ((test_bi

[Devel] [PATCH rh7] ploop: fix counting bio_qlen

2016-06-16 Thread Maxim Patlasov
The commit ec1eeb868 (May 22 2015) ported "separate queue for discard bio"
patch from RHEL6-based kernel incorrectly. Original patch stated clearly
that if we want to decrement bio_discard_qlen, bio_qlen must not change:

@@ -500,7 +502,7 @@ ploop_bio_queue(struct ploop_device * pl
(err = ploop_discard_add_bio(plo->fbd, bio))) {
BIO_ENDIO(bio, err);
list_add(&preq->list, &plo->free_list);
-   plo->bio_qlen--;
+   plo->bio_discard_qlen--;
plo->bio_total--;
return;
}

but that port did the opposite:

@@ -521,6 +523,7 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
BIO_ENDIO(plo->queue, bio, err);
list_add(&preq->list, &plo->free_list);
plo->bio_qlen--;
+   plo->bio_discard_qlen--;
plo->bio_total--;
    return;
}

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index db55be3..e1fbfcf 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -523,7 +523,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
}
BIO_ENDIO(plo->queue, bio, err);
list_add(&preq->list, &plo->free_list);
-   plo->bio_qlen--;
plo->bio_discard_qlen--;
plo->bio_total--;
return;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ploop: io_kaio: fix silly bug in kaio_complete_io_state()

2016-06-16 Thread Maxim Patlasov
It's useless to check for preq->req_rw & REQ_FUA after:
preq->req_rw &= ~REQ_FUA;

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_kaio.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 79aa9af..de26319 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -71,8 +71,6 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
return;
}
 
-   preq->req_rw &= ~REQ_FUA;
-
/* Convert requested fua to fsync */
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-20 Thread Maxim Patlasov

Dima,

On 06/19/2016 06:06 AM, Dmitry Monakhov wrote:

Maxim Patlasov  writes:


On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:

Dmitry Monakhov  writes:


Maxim Patlasov  writes:


Dima,

I agree that the ploop barrier code is broken in many ways, but I don't
think the patch actually fixes it. I hope you would agree that
completion of REQ_FUA guarantees only landing that particular bio to the
disk; it says nothing about flushing previously submitted (and
completed) bio-s and it is also possible that power outage may catch us
when this REQ_FUA is already landed to the disk, but previous bio-s are
not yet.

Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
So yes. it would be more correct to tag WBI with FLUSH_FUA

Hence, for RELOC_{A|S} requests we actually need something like that:

RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA

(i.e. we do need to flush all previously submitted data before starting
to update BAT on disk)


Correct sequence:
RELOC_S: R1, W2, WBI:FLUSH_FUA
RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA


not simply:


RELOC_S: R1, W2, WBI:FUA
RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA

Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
could remove them completely (along we that optimization delaying
incoming FUA) and re-implement all this stuff from scratch:

1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
REQ_FUA in preq->req_rw before calling ->submit(preq)

2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
RELOC_A|S in ploop_index_update and map_wb_complete

3) For that optimization delaying incoming FUA (what we do now if
ploop_req_delay_fua_possible() returns true) we could introduce new
ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
and map_wb_complete (the same thing as 2) above). And, yes, let's
WARN_ON if we somehow missed its processing.

Yes. This was one of my ideas.
1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
PLOOP_IO_FLUSH_DELAYED.
2) fix ->write_page to handle flush as it does with fua.

The only complication I foresee is about how to teach kaio to pre-flush
in kaio_write_page -- it's doable, but involves kaio_resubmit that's
already pretty convoluted.


Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.

Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
which is async and not suitable for page_write.

I think it's doable to process page_write via kaio_fsync_thread, but
it's tricky.


Max let's make an agreement about terminology.
The reason I wrote this is because linux internally interpret FUA as
preflush,write,postflush which is wrong from academic point of view but
it is the world we live it linux.

Are you sure that this  (FUA == preflush,write,postflush) is universally
true (i.e. no exceptions)? What about bio-based block-device drivers?


This is the reason I read code
diferently from the way it was designed.
Let's state that ploop is an ideal world where:
FLUSH ==> preflush
FUA   ==> WRUTE,postflush

In ideal word FUA is not obliged to be handled by postflush: it's enough
to guarantee that *this* particular request went to platter, other
requests may remain not-flushed-yet.
Documentation/block/writeback_cache_control.txt is absolutely clear
about it:


The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted
from the
filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.
...
If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.



For what reasona we can perform reloc scheme as:

RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
RELOC_A: R1,W2:FUA,WBI:FUA

This allows effectively handle FUA and convert it to DELAYED_FLUSH where
possible.

Ploop has the concept of map_multi_updates. In short words, while you're
handling one update, many others may come to PLOOP_E_INDEX_DELAY state.
And, as soon as the first one is done, we modify many indices in one
loop (see map_wb_complete), then write that page to disk only once.
Having map_multi_update in mind, it may be suboptimal to make many
W2:FUA-s -- it may be better to do many ordinary W2-s instead, and only
one pre-FLUSH later -- when we're going to write BAT page on disk.


Also let's clarify may_fua_delay semantics to exact eng_state

may_fua_delay {

int may_delay = 1;
/* effectively this is equivalent of
   preq->eng_state != PLOOP_E_COMPLETE
   but it is more readable, and more error prone in future
*/
if (preq-

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-20 Thread Maxim Patlasov

Dima,

I agree with general approach of this patch, but there are some 
(easy-to-fix) issues. See, please, inline comments below...


On 06/20/2016 11:58 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
   ->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FUA


Sorry, I can't agree, it actually does not ignore:


static void
kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
 struct page * page, sector_t sec, int fua)
{
/* No FUA in kaio, convert it to fsync */
if (fua)
set_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state);




BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
not just latest one.


No need to tag *all*. See inline comments below.


This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
- fix up fua handling for dio_submit

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c   |  8 +---
  drivers/block/ploop/io_direct.c | 29 +-
  drivers/block/ploop/io_kaio.c   | 17 ++--
  drivers/block/ploop/map.c   | 45 ++---
  include/linux/ploop/ploop.h |  8 
  5 files changed, 54 insertions(+), 53 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
  
  	__TRACE("Z %p %u\n", preq, preq->req_cluster);
  
+	if (!preq->error) {

+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
  
-		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..d7ecd4a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
trace_submit(preq);
  
  	preflush = !!(rw & REQ_FLUSH);

-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   postfua = !!(rw & REQ_FUA);
+   if (ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state);
+   postfua = 0;
}


"postfua" is a horrible name, let us see if we can get rid of it 
completely. Also, the way how ploop_req_delay_fua_possible implemented 
is prone to errors (see below an issue in kaio_complete_io_state). Let's 
rework it like this:


static inline bool ploop_req_delay_fua_possible(struct ploop_request 
*preq)

{
return preq->eng_state == PLOOP_E_DATA_WBI;
}


Then, that chunk in the dio_submit above might look as:


/* If we can delay, mark req that delayed flush required */
if ((r

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-21 Thread Maxim Patlasov

On 06/21/2016 12:25 AM, Dmitry Monakhov wrote:

Maxim Patlasov  writes:


Dima,

I agree with general approach of this patch, but there are some
(easy-to-fix) issues. See, please, inline comments below...

On 06/20/2016 11:58 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
   ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
->delta->allocate
   ->io->submit_allloc: dio_submit_alloc
 ->dio_submit_pad
E_DATA_WBI : data written, time to update index
->delta->allocate_complete:ploop_index_update
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
  ->write_page
  ->ploop_map_wb_complete
->ploop_wb_complete_post_process
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

 ->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FUA

Sorry, I can't agree, it actually does not ignore:

I've misstyped. I ment to say REQ_FLUSH.

static void
kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
  struct page * page, sector_t sec, int fua)
{
 /* No FUA in kaio, convert it to fsync */
 if (fua)
 set_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state);



BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
 not just latest one.

No need to tag *all*. See inline comments below.


This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
- fix up fua handling for dio_submit

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
   drivers/block/ploop/dev.c   |  8 +---
   drivers/block/ploop/io_direct.c | 29 +-
   drivers/block/ploop/io_kaio.c   | 17 ++--
   drivers/block/ploop/map.c   | 45 
++---
   include/linux/ploop/ploop.h |  8 
   5 files changed, 54 insertions(+), 53 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
   
   	__TRACE("Z %p %u\n", preq, preq->req_cluster);
   
+	if (!preq->error) {

+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
   
-		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..d7ecd4a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
trace_submit(preq);
   
   	preflush = !!(rw & REQ_FLUSH);

-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   postfua = !!(rw & REQ_FUA);
+   if (ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state);
+   postfua = 0;
}

"postfua" is a horrible name, let us see if we can get rid of it
completely. Also, the way how ploop_req_delay_fua_possible implemented
is prone to error

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v3

2016-06-21 Thread Maxim Patlasov

Dima,

After more thinking I realized that the whole idea of 
PLOOP_REQ_DELAYED_FLUSH might be bogus: it is possible that we simply do 
not have many enough incoming FUA-s to make delaying lucrative. This 
patch actually mixes three things: 1) fix barriers for RELOC_A|S 
requests, 2) fix barriers for ordinary requests, 3) DELAYED_FLUSH 
optimization. So, please, split the patch into three and make some 
measurements demonstrating that applying "DELAYED_FLUSH optimization" 
patch on top of previous patches improves performance.


I have an idea about how to fix barriers for ordinary requests -- see 
please the patch I'll send soon. The key point is that handling FLUSH-es 
is broken the same way as FUA: if you observe (rw & REQ_FLUSH) and sends 
first bio marked as REQ_FLUSH, it guarantees nothing unless you wait for 
completion before submitting further bio-s! And ploop simply does not 
have the logic of waiting the first before sending others. And, to make 
things worse, not only dio_submit is affected, dio_sibmit_pad and 
dio_io_page to be fixed too.


There are also some inline comments below...

On 06/21/2016 06:55 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
   ->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FLUSH
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH
- fix fua handling for dio_submit
- BUG_ON for REQ_FLUSH in kaio_page_write

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c   |  8 +---
  drivers/block/ploop/io_direct.c | 30 ++-
  drivers/block/ploop/io_kaio.c   | 23 +
  drivers/block/ploop/map.c   | 45 ++---
  include/linux/ploop/ploop.h | 19 +
  5 files changed, 60 insertions(+), 65 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
  
  	__TRACE("Z %p %u\n", preq, preq->req_cluster);
  
+	if (!preq->error) {

+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
  
-		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..303eb70 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -83,28 +83,19 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int err;
struct bio_list_walk bw;
int preflush;
-   int postfua = 0;
+   int fua = 0;
int write = !!(rw & REQ_WRITE);
int bio_num;


Your patch obsoletes bio_num. Please remove it.

  
  	trace_submit(preq);
  
  	preflush = !!(rw & REQ_FLUSH);

-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   fua =

[Devel] [PATCH rh7] ploop: fix barriers for ordinary requests

2016-06-21 Thread Maxim Patlasov
The way how io_direct.c handles FLUSH|FUA: b1:FLUSH,b2,b3,b4,b5:FLUSH|FUA
is completely wrong: to make sure that b1:FLUSH made effect we have to
wait for its completion. Similarly, even if we're sure that FUA will be
processed as post-FLUSH (also dubious!), we have to wait for completion
b1..b4 to make sure that that flush will cover them.

The patch fixes all these issues pretty simple: let's mark outgouing
bio-s with FLUSH|FUA based on those flags in *corresponing* incoming
bio-s.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |1 -
 drivers/block/ploop/io_direct.c |   47 ---
 2 files changed, 15 insertions(+), 33 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2ef1449..6b5702f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -498,7 +498,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
preq->req_sector = bio->bi_sector;
preq->req_size = bio->bi_size >> 9;
preq->req_rw = bio->bi_rw;
-   bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
preq->eng_state = PLOOP_E_ENTRY;
preq->state = 0;
preq->error = 0;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 6ef9cd8..84c9a48 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -92,7 +92,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int preflush;
int postfua = 0;
int write = !!(rw & REQ_WRITE);
-   int bio_num;
 
trace_submit(preq);
 
@@ -233,13 +232,13 @@ flush_bio:
goto flush_bio;
}
 
+   bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);
bw.bv_off += copy;
size -= copy >> 9;
sec += copy >> 9;
}
ploop_extent_put(em);
 
-   bio_num = 0;
while (bl.head) {
struct bio * b = bl.head;
unsigned long rw2 = rw;
@@ -255,11 +254,10 @@ flush_bio:
preflush = 0;
}
if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   rw2 |= REQ_FUA;
 
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
-   submit_bio(rw2, b);
-   bio_num++;
+   submit_bio(rw2 | b->bi_rw, b);
}
 
ploop_complete_io_request(preq);
@@ -567,7 +565,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
sector_t sec, end_sec, nsec, start, end;
struct bio_list_walk bw;
int err;
-   int preflush = !!(preq->req_rw & REQ_FLUSH);
 
bio_list_init(&bl);
 
@@ -598,14 +595,17 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
* preq,
while (sec < end_sec) {
struct page * page;
unsigned int poff, plen;
+   bool zero_page;
 
if (sec < start) {
+   zero_page = true;
page = ZERO_PAGE(0);
poff = 0;
plen = start - sec;
if (plen > (PAGE_SIZE>>9))
plen = (PAGE_SIZE>>9);
} else if (sec >= end) {
+   zero_page = true;
page = ZERO_PAGE(0);
poff = 0;
plen = end_sec - sec;
@@ -614,6 +614,7 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
} else {
/* sec >= start && sec < end */
struct bio_vec * bv;
+   zero_page = false;
 
if (sec == start) {
bw.cur = sbl->head;
@@ -672,6 +673,10 @@ flush_bio:
goto flush_bio;
}
 
+   /* Handle FLUSH here, dio_post_submit will handle FUA */
+   if (!zero_page)
+   bio->bi_rw |= bw.cur->bi_rw & REQ_FLUSH;
+
bw.bv_off += (plen<<9);
BUG_ON(plen == 0);
sec += plen;
@@ -688,13 +693,9 @@ flush_bio:
b->bi_private = preq;
b->bi_end_io = dio_endio_async;
 
-   rw = sbl->head->bi_rw | WRITE;
-   if (unlikely(preflush)) {
-   rw |= REQ_FLUSH;
-   preflush = 0;
-   }
+   rw = preq->req_rw & ~(REQ_FLUSH | REQ_FUA);
ploop_acc_ff_out(preq->plo, rw | b->bi_rw);
-   submit_bio(rw, b);
+   submit_bio(rw | b->bi_rw, b);
}
 
ploop_complete_i

Re: [Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-22 Thread Maxim Patlasov

Kostya,

The patch is OK per-se, please commit it with:

Acked-by: Maxim Patlasov 

Thanks,
Maxim

On 06/21/2016 06:55 AM, Dmitry Monakhov wrote:

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 24 ++--
  1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..58d7580 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,27 +517,31 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
  
  	file_start_write(io->files.file);
  
-	/* Here io->io_count is even ... */

-   spin_lock_irq(&plo->lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
-   spin_unlock_irq(&plo->lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
+   spin_unlock_irq(&plo->lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
  
  	/* highly unlikely case: FUA coming to a block not provisioned yet */

-   if (!err && (preq->req_rw & REQ_FUA))
+   if (!err && force_sync)
err = io->ops->sync(io);
  
-	spin_lock_irq(&plo->lock);

-   io->io_count++;
-   spin_unlock_irq(&plo->lock);
+   if (!force_sync) {
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   spin_unlock_irq(&plo->lock);
+   }
/* and here io->io_count is even (+2) again. */
  
  	file_end_write(io->files.file);


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-22 Thread Maxim Patlasov

Kostya,

The patch is OK per-se, please commit it with:

Acked-by: Maxim Patlasov 

Thanks,
Maxim

On 06/21/2016 06:55 AM, Dmitry Monakhov wrote:

(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 9 -
  1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 58d7580..a6d83fe 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
  
  
-	/* In case of eng_state != COMPLETE, we'll do FUA in

-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init(&bl);
  
  	if (iblk == PLOOP_ZERO_INDEX)


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: fix barriers for ordinary requests

2016-06-22 Thread Maxim Patlasov

On 06/22/2016 06:41 AM, Dmitry Monakhov wrote:

Maxim Patlasov  writes:


The way how io_direct.c handles FLUSH|FUA: b1:FLUSH,b2,b3,b4,b5:FLUSH|FUA
is completely wrong: to make sure that b1:FLUSH made effect we have to
wait for its completion. Similarly, even if we're sure that FUA will be
processed as post-FLUSH (also dubious!), we have to wait for completion
b1..b4 to make sure that that flush will cover them.

The patch fixes all these issues pretty simple: let's mark outgouing
bio-s with FLUSH|FUA based on those flags in *corresponing* incoming
bio-s.

One more thing please see below.

Signed-off-by: Maxim Patlasov 
---
  drivers/block/ploop/dev.c   |1 -
  drivers/block/ploop/io_direct.c |   47 ---
  2 files changed, 15 insertions(+), 33 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2ef1449..6b5702f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -498,7 +498,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
preq->req_sector = bio->bi_sector;
preq->req_size = bio->bi_size >> 9;
preq->req_rw = bio->bi_rw;
-   bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
preq->eng_state = PLOOP_E_ENTRY;
preq->state = 0;
preq->error = 0;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 6ef9cd8..84c9a48 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -92,7 +92,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int preflush;
int postfua = 0;
int write = !!(rw & REQ_WRITE);
-   int bio_num;
  
  	trace_submit(preq);
  
@@ -233,13 +232,13 @@ flush_bio:

goto flush_bio;
}
  
+		bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);

bw.bv_off += copy;
size -= copy >> 9;
sec += copy >> 9;
}
ploop_extent_put(em);
  
-	bio_num = 0;

while (bl.head) {
struct bio * b = bl.head;
unsigned long rw2 = rw;
@@ -255,11 +254,10 @@ flush_bio:
preflush = 0;
}
if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   rw2 |= REQ_FUA;
  
  		ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);

-   submit_bio(rw2, b);
-   bio_num++;
+   submit_bio(rw2 | b->bi_rw, b);
}
  
  	ploop_complete_io_request(preq);

@@ -567,7 +565,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
sector_t sec, end_sec, nsec, start, end;
struct bio_list_walk bw;
int err;
-   int preflush = !!(preq->req_rw & REQ_FLUSH);
  
  	bio_list_init(&bl);
  
@@ -598,14 +595,17 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * preq,

while (sec < end_sec) {
struct page * page;
unsigned int poff, plen;
+   bool zero_page;
  
  		if (sec < start) {

+   zero_page = true;
page = ZERO_PAGE(0);
poff = 0;
plen = start - sec;
if (plen > (PAGE_SIZE>>9))
plen = (PAGE_SIZE>>9);
} else if (sec >= end) {
+   zero_page = true;
page = ZERO_PAGE(0);
poff = 0;
plen = end_sec - sec;
@@ -614,6 +614,7 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
} else {
/* sec >= start && sec < end */
struct bio_vec * bv;
+   zero_page = false;
  
  			if (sec == start) {

bw.cur = sbl->head;
@@ -672,6 +673,10 @@ flush_bio:
goto flush_bio;
}
  
+		/* Handle FLUSH here, dio_post_submit will handle FUA */

submit_pad may be called w/o post_submit flag from here:
->dio_submit_alloc
   if (io->files.em_tree->_get_extent) {
->dio_fallocate
->dio_submit_pad
   ..
  }


We never has _get_extent set. This is legacy code for PCSS support, 
we'll remove it. For now, we can safely ignore this.


Thanks,
Maxim
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v3

2016-06-22 Thread Maxim Patlasov

Dima,

I'm uneasy that we still have handling RELOC_A|S broken. It seems we 
have full agreement that for such requests we can do unconditional 
FLUSH|FUA when we call write_page from ploop_index_update() and 
map_wb_complete(). And your idea to implement it by passing FLUSH|FUA 
for io_direct and post_fsync=1 for io_kaio is smart and OK. Will you 
send patch for that (fix barriers for RELOC_A|S requests)?


Thanks,
Maxim

On 06/21/2016 04:56 PM, Maxim Patlasov wrote:

Dima,

After more thinking I realized that the whole idea of 
PLOOP_REQ_DELAYED_FLUSH might be bogus: it is possible that we simply 
do not have many enough incoming FUA-s to make delaying lucrative. 
This patch actually mixes three things: 1) fix barriers for RELOC_A|S 
requests, 2) fix barriers for ordinary requests, 3) DELAYED_FLUSH 
optimization. So, please, split the patch into three and make some 
measurements demonstrating that applying "DELAYED_FLUSH optimization" 
patch on top of previous patches improves performance.


I have an idea about how to fix barriers for ordinary requests -- see 
please the patch I'll send soon. The key point is that handling 
FLUSH-es is broken the same way as FUA: if you observe (rw & 
REQ_FLUSH) and sends first bio marked as REQ_FLUSH, it guarantees 
nothing unless you wait for completion before submitting further 
bio-s! And ploop simply does not have the logic of waiting the first 
before sending others. And, to make things worse, not only dio_submit 
is affected, dio_sibmit_pad and dio_io_page to be fixed too.


There are also some inline comments below...

On 06/21/2016 06:55 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} 
correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad 
and write_page (for indexes)

So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FLUSH
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag 
all bios via REQ_FUA

not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH
- fix fua handling for dio_submit
- BUG_ON for REQ_FLUSH in kaio_page_write

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c   |  8 +---
  drivers/block/ploop/io_direct.c | 30 ++-
  drivers/block/ploop/io_kaio.c   | 23 +
  drivers/block/ploop/map.c   | 45 
++---

  include/linux/ploop/ploop.h | 19 +
  5 files changed, 60 insertions(+), 65 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct 
ploop_request * preq)

__TRACE("Z %p %u\n", preq, preq->req_cluster);
  +if (!preq->error) {
+WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
+}
  while (preq->bl.head) {
  struct bio * bio = preq->bl.head;
  preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
  top_delta = ploop_top_delta(plo);
  sbl.head = sbl.tail = preq->aux_bio;
  -/* Relocated data write required sync before BAT updatee */
-set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+/* Relocated data write required sync before BAT updatee
+ * this will happen inside index_update */
  if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
  preq->eng_state = PLOOP_E_DATA_WBI;
  plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c 
b/drivers/block/ploop/io_direct.c

index a6d83fe..303eb70 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -83,28 +83,19 @@ dio_submit(struct ploop_io *io, struct 
ploop_request * preq,

  int err;
  

Re: [Devel] [RH7 PATCH 0/6] RFC ploop: Barrier fix patch set v3

2016-06-23 Thread Maxim Patlasov

Dima,

On 06/23/2016 10:25 AM, Dmitry Monakhov wrote:

Here is 3'rd version of barrier fix patches based on recent fixes.
This is an RFC version. I do not have time to test it before tomorrow,
Max please review is briefly and tell be your oppinion about general idea.


It's hard to review w/o context, and the series fail to apply to our vz7 
tree. So, I spent pretty long while trying to find a tag or commit where 
it's possible to apply your patches w/o rejects. The first patch wants 
PLOOP_REQ_ALLOW_READS in ploop.h:



@@ -471,6 +471,7 @@ enum
 PLOOP_REQ_POST_SUBMIT, /* preq needs post_submit processing */
 PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
 PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
+PLOOP_REQ_DEL_CONV,/* post_submit: conversion required */
 PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
 };


We removed ALLOW_READS by 06e7586 (Jun 3), so you must have 
rh7-3.10.0-327.18.2.vz7.14.11 or earlier. But the third patch has:


@@ -562,7 +551,6 @@ dio_submit_pad(struct ploop_io *io, struct 
ploop_request * preq,

 sector_t sec, end_sec, nsec, start, end;
 struct bio_list_walk bw;
 int err;
-
 bio_list_init(&bl);

 /* sec..end_sec is the range which we are going to write */


while after applying the first and the second, it looks like:


static void
dio_submit_pad(struct ploop_io *io, struct ploop_request * preq,
   struct bio_list * sbl, unsigned int size,
   struct extent_map *em)
{
struct bio_list bl;
struct bio * bio = NULL;
sector_t sec, end_sec, nsec, start, end;
struct bio_list_walk bw;
int err;
int preflush = !!(preq->req_rw & REQ_FLUSH);

bio_list_init(&bl);


So the tree you used didn't have that "int preflush = !!(preq->req_rw & 
REQ_FLUSH);" line. But the patch removing this line was committed only 
yesterday, Jun 22 (c2247f3745).


After applying c2247f3745 before your series, another conflict happens 
while applying the third patch: the first hunk assumes the following 
lines in dio_submit:



 rw &= ~(REQ_FLUSH | REQ_FUA);
-
-
 bio_list_init(&bl);


But we always (since 2013) had:


rw &= ~(REQ_FLUSH | REQ_FUA);


/* In case of eng_state != COMPLETE, we'll do FUA in
 * ploop_index_update(). Otherwise, we should mark
 * last bio as FUA here. */
if (rw & REQ_FUA) {
rw &= ~REQ_FUA;
if (preq->eng_state == PLOOP_E_COMPLETE)
postfua = 1;
}

bio_list_init(&bl);


Hence, I drew conclusion, we need one of your previous patches to be 
applied at first. After very long row of trials and errors I eventually 
succeeded:


$ git show c2247f3745 > /tmp/my.diff
$ git checkout rh7-3.10.0-327.18.2.vz7.14.11
$ patch -p1 < ploop-fix-barriers-for-ordinary-requests
$ patch -p1 < ploop-skip-redundant-fsync-for-REQ_FUA-in-post_submit
$ patch -p1 < ploop-deadcode-cleanup
$ patch -p1 < ploop-generalize-post_submit-stage
$ patch -p1 < ploop-generalize-issue_flush
$ patch -p1 < ploop-add-delayed-flush-support
$ patch -p1 < ploop-io_kaio-support-PLOOP_REQ_DEL_FLUSH
$ patch -p1 < ploop-fixup-barrier-handling-during-relocation
$ patch -p1 < patch-ploop_state_debugging.patch



Basic idea is to use post_submit state to issue empty FLUSH barrier in order
to complete FUA requests. This allow us to unify all engines (direct and kaio).

This makes FUA processing optimal:
SUBMIT:FUA   :W1{b1,b2,b3,b4..},WAIT,post_submit:FLUSH
SUBMIT_ALLOC:FUA :W1{b1,b2,b3,b4..},WAIT,post_submit:FLUSH, WBI:FUA


The above would be optimal only if all three statements below are true:

1) Lower layers process FUA as post-FLUSH.

I remember that you wrote that in real (linux kernel) life it is always 
true, but somehow I'm not sure about this "always"... Of course, we can 
investigate and eventually nail down the question, but for now I'm not 
convinced.


2) The list of (incoming) bio-s has more than one marked as FUA. 
Otherwise (i.e. if only one is FUA), it must be equivalent (from 
performance perspective): to submit FUA now, or to submit FLUSH later 
(modulo 1) above).


For now, I know only two entities generating FUA: ext4 writes superblock 
and jbd2 commits transaction. In both cases it is one page-sized bio and 
no way to have more than one FUA in queue. Do you know other cases?
(Of course, we know about fsync(), but it generates zero-size FLUSH. The 
way how ploop processes it is not affected by the patches we discussed.)


3) The benefits of delayed FLUSH must overweight the performance loss 
we'll have from extra WAIT introduced. I have not verified it yet, but I 
suspect it must be observable (e.g. by blktrace) that one page-sized bio 
marked as FLUSH|FUA completes faster than: submit one page-sized bio 
marked as FLUSH, wait for completion, submit zero-size FLUSH, wait for 
completion again. Makes sense?


If the three statements above are correct, and given some complexity 
added by the series, and poss

Re: [Devel] [RH7 PATCH 1/6] ploop: generalize post_submit stage

2016-06-23 Thread Maxim Patlasov

On 06/23/2016 10:25 AM, Dmitry Monakhov wrote:

Currently post_submit() used only for convert_unwritten_extents.
But post_submit() is good transition point where all submitted
data was completed by lower layer, and new state about to be processed.
Iyt is ideal point where we can perform transition actions
For example:
  io_direct: Convert unwritten extents
  io_direct: issue empty barrier bio in order to simulate postflush
  io_direct,io_kaio: queue to fsync queue
  Etc.

This patch does not change anything, but prepare post_submit for
more logic which will be added later.


If we decide to have DEL_FLUSH, I'm OK with this approach. Maybe with 
some renaming:


s/PLOOP_REQ_DEL_FLUSH/PLOOP_REQ_FLUSH_DELAYED
s/PLOOP_REQ_DEL_CONV/PLOOP_REQ_CONV_DELAYED
s/post_submit/pre_process
s/PLOOP_REQ_POST_SUBMIT/PLOOP_REQ_PRE_PROCESS



Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c   | 10 ++
  drivers/block/ploop/io_direct.c | 15 ---
  include/linux/ploop/ploop.h | 12 +++-
  3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e405232..e8b0304 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2351,10 +2351,12 @@ static void ploop_req_state_process(struct 
ploop_request * preq)
preq->prealloc_size = 0; /* only for sanity */
}
  
-	if (test_bit(PLOOP_REQ_POST_SUBMIT, &preq->state)) {

-   preq->eng_io->ops->post_submit(preq->eng_io, preq);
-   clear_bit(PLOOP_REQ_POST_SUBMIT, &preq->state);
+   if (test_and_clear_bit(PLOOP_REQ_POST_SUBMIT, &preq->state)) {
+   struct ploop_io *io = preq->eng_io;
+
preq->eng_io = NULL;
+   if (preq->eng_io->ops->post_submit(io, preq))
+   goto out;
}
  
  restart:

@@ -2633,7 +2635,7 @@ restart:
default:
BUG();
}
-
+out:
if (release_ioc) {
struct io_context * ioc = current->io_context;
current->io_context = saved_ioc;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index f1812fe..ec905b4 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -416,8 +416,8 @@ try_again:
}
  
  		preq->iblock = iblk;

-   preq->eng_io = io;
-   set_bit(PLOOP_REQ_POST_SUBMIT, &preq->state);
+   set_bit(PLOOP_REQ_DEL_CONV, &preq->state);
+   ploop_add_post_submit(io, preq);
dio_submit_pad(io, preq, sbl, size, em);
err = 0;
goto end_write;
@@ -501,7 +501,7 @@ end_write:
  }
  
  static void

-dio_post_submit(struct ploop_io *io, struct ploop_request * preq)
+dio_convert_extent(struct ploop_io *io, struct ploop_request * preq)
  {
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
@@ -540,6 +540,15 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
}
  }
  
+static int

+dio_post_submit(struct ploop_io *io, struct ploop_request * preq)
+{
+   if (test_and_clear_bit(PLOOP_REQ_DEL_CONV, &preq->state))
+   dio_convert_extent(io, preq);
+
+   return 0;
+}
+
  /* Submit the whole cluster. If preq contains only partial data
   * within the cluster, pad the rest of cluster with zeros.
   */
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 0fba25e..4c52a40 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -148,7 +148,7 @@ struct ploop_io_ops
  struct bio_list *sbl, iblock_t iblk, unsigned int 
size);
void(*submit_alloc)(struct ploop_io *, struct ploop_request *,
struct bio_list *sbl, unsigned int size);
-   void(*post_submit)(struct ploop_io *, struct ploop_request *);
+   int (*post_submit)(struct ploop_io *, struct ploop_request *);
  
  	int	(*disable_merge)(struct ploop_io * io, sector_t isector, unsigned int len);

int (*fastmap)(struct ploop_io * io, struct bio *orig_bio,
@@ -471,6 +471,7 @@ enum
PLOOP_REQ_POST_SUBMIT, /* preq needs post_submit processing */
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
+   PLOOP_REQ_DEL_CONV,/* post_submit: conversion required */
PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
  };
  
@@ -479,6 +480,8 @@ enum

  #define PLOOP_REQ_RELOC_S_FL (1 << PLOOP_REQ_RELOC_S)
  #define PLOOP_REQ_DISCARD_FL (1 << PLOOP_REQ_DISCARD)
  #define PLOOP_REQ_ZERO_FL (1 << PLOOP_REQ_ZERO)
+#define PLOOP_REQ_POST_SUBMIT_FL (1 << PLOOP_REQ_POST_SUBMIT)
+#define PLOOP_REQ_DEL_CONV_FL (1 << PLOOP_REQ_DEL_CONV)
  
  enum

  {
@@ -767,6 +770,13 @@ static inline void ploop_entry_qlen_dec(stru

Re: [Devel] [RH7 PATCH 3/6] ploop: add delayed flush support

2016-06-23 Thread Maxim Patlasov

On 06/23/2016 10:25 AM, Dmitry Monakhov wrote:

dio_submit and dio_submit_pad may produce several bios. This makes
processing of REQ_FUA complicated because in order to preserve correctness
we have to TAG each bio with FUA flag which is suboptimal.
Obviously there is a room for optimization here: once all bios was acknowledged
by lower layer we may issue empty barrier aka ->issue_flush().
post_submit call back is the place where we all bios completed already.

b1:FUA, b2:FUA, b3:FUA =>  b1,b2,b3,wait_for_bios,bX:FLUSH

This allow us to remove all this REQ_FORCE_{FLUSH,FUA} crap and


It seems we can remove REQ_FORCE_{FLUSH,FUA} right now. Only RELOC_A|S 
needs it and we can fix them in a simple and straightforward way -- 
essentially your 5th patch must be enough enough.


Btw, this patch doesn't disable the logic of passing FUA from incoming 
bio-s to outgoing (commit c2247f3745). Was it by a miss or deliberately?




Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_direct.c | 48 +
  include/linux/ploop/ploop.h |  2 ++
  2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 195d318..752a9c3e 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -82,31 +82,13 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
sector_t sec, nsec;
int err;
struct bio_list_walk bw;
-   int preflush;
-   int postfua = 0;
+   int preflush = !!(rw & REQ_FLUSH);
+   int postflush = !!(rw & REQ_FUA);
int write = !!(rw & REQ_WRITE);
  
  	trace_submit(preq);
  
-	preflush = !!(rw & REQ_FLUSH);

-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
-   /* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
-   }
-
rw &= ~(REQ_FLUSH | REQ_FUA);
-
-
bio_list_init(&bl);
  
  	if (iblk == PLOOP_ZERO_INDEX)

@@ -237,13 +219,14 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= REQ_FUA;
-
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
}
-
+   /* TODO: minor optimization is possible for single bio case */
+   if (postflush) {
+   set_bit(PLOOP_REQ_DEL_FLUSH, &preq->state);
+   ploop_add_post_submit(io, preq);
+   }
ploop_complete_io_request(preq);
return;
  
@@ -523,9 +506,10 @@ dio_convert_extent(struct ploop_io *io, struct ploop_request * preq)

  (loff_t)sec << 9, clu_siz);
  
  	/* highly unlikely case: FUA coming to a block not provisioned yet */

-   if (!err && force_sync)
+   if (!err && force_sync) {
+   clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state);
err = io->ops->sync(io);
-
+   }
if (!force_sync) {
spin_lock_irq(&plo->lock);
io->io_count++;
@@ -546,7 +530,12 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (test_and_clear_bit(PLOOP_REQ_DEL_CONV, &preq->state))
dio_convert_extent(io, preq);
  
+	if (test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state)) {

+   io->ops->issue_flush(io, preq);
+   return 1;
+   }
return 0;
+
  }
  
  /* Submit the whole cluster. If preq contains only partial data

@@ -562,7 +551,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
sector_t sec, end_sec, nsec, start, end;
struct bio_list_walk bw;
int err;
-
bio_list_init(&bl);
  
  	/* sec..end_sec is the range which we are going to write */

@@ -694,7 +682,11 @@ flush_bio:
ploop_acc_ff_out(preq->plo, rw | b->bi_rw);
submit_bio(rw, b);
}
-
+   /* TODO: minor optimization is possible for single bio case */
+   if (preq->req_rw &  REQ_FUA) {
+   set_bit(PLOOP_REQ_DEL_FLUSH, &preq->state);
+   ploop_add_post_submit(io, preq);
+   }
ploop_complete_io_request(preq);
return;
  
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h

index 4c52a40..5076f16 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -472,6 +472,7 @@ enum
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
PLOOP_REQ_DEL_CONV,/* post_submit: conversion requ

Re: [Devel] [RH7 PATCH 4/6] ploop: io_kaio support PLOOP_REQ_DEL_FLUSH

2016-06-23 Thread Maxim Patlasov

On 06/23/2016 10:25 AM, Dmitry Monakhov wrote:

Currently noone tag preqs with such bit but let it be here for simmetry


I hate dead code (things that impossible to verify by any test). Can we 
add this "symmetry" check later, along with a patch for kaio setting 
this bit? (i.e.: kaio doesn't set it; ergo it must not check it)




Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/io_kaio.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index bee2cee..5341fd5 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -73,6 +73,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
  
  	/* Convert requested fua to fsync */

if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
+   test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))
post_fsync = 1;
  


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RH7 PATCH 5/6] ploop: fixup barrier handling during relocation

2016-06-23 Thread Maxim Patlasov
No reasons to keep it along with optimization patches. See please the 
port I'll send soon today.


On 06/23/2016 10:25 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
   ->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

->submit()

Once we have delayed_flush engine it is easy to implement correct scheme for
both engines.

E_RELOC_DATA_READ ->submit_allloc => wait->post_submit->issue_flush
E_DATA_WBI ->ploop_index_update with FUA
E_RELOC_NULLIFY ->submit: => wait->post_submit->issue_flush

This makes reloc sequence optimal:
RELOC_S: R1, W2,WAIT,FLUSH, WBI:FUA
RELOC_A: R1, W2,WAIT,FLUSH, WBI:FUA, W1:NULLIFY,WAIT, FLUSH

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c |  2 +-
  drivers/block/ploop/io_kaio.c |  3 +--
  drivers/block/ploop/map.c | 28 ++--
  3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 95e3067..090cd2d 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2533,7 +2533,7 @@ restart:
sbl.head = sbl.tail = preq->aux_bio;
  
  		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
+   preq->req_rw |= REQ_FUA;
  
  		if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {

preq->eng_state = PLOOP_E_DATA_WBI;
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 5341fd5..5217ab4 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -72,8 +72,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
}
  
  	/* Convert requested fua to fsync */

-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
-   test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state) ||
+   if (test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))
post_fsync = 1;
  
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c

index 3a6365d..ef351fb 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -901,6 +901,8 @@ void ploop_index_update(struct ploop_request * preq)
int old_level;
struct page * page;
sector_t sec;
+   int fua = !!(preq->req_rw & REQ_FUA);
+   unsigned long state = READ_ONCE(preq->state);
  
  	/* No way back, we are going to initiate index write. */
  
@@ -954,12 +956,11 @@ void ploop_index_update(struct ploop_request * preq)

plo->st.map_single_writes++;
top_delta->ops->map_index(top_delta, m->mn_start, &sec);
/* Relocate requires consistent writes, mark such reqs appropriately */
-   if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
-   test_bit(PLOOP_REQ_RELOC_S, &preq->state))
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
-   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
- !!(preq->req_rw & REQ_FUA));
+   if (state & (PLOOP_REQ_RELOC_A_FL | PLOOP_REQ_RELOC_S_FL)) {
+   WARN_ON(state & PLOOP_REQ_DEL_FLUSH_FL);
+   fua = 1;
+   }
+   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, fua);
put_page(page);
return;
  
@@ -1063,7 +1064,7 @@ static void map_wb_complete_post_process(struct ploop_map *map,

 * (see dio_submit()). So fsync of EXT4 image doesnt help us.
 * We need to force sync of nullified blocks.
 */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
+   preq->req_rw |= REQ_FUA;
top_delta->io.ops->submit(&top_delta->io, preq, preq->req_rw,
  &sbl, preq->iblock, 1io_queue) {

struct ploop_request * preq;
+   unsigned long state;
  
  		preq = list_entry(cursor, struct ploop_request, list);

+   state = READ_ONCE(preq->state);
  
  		switch (preq->eng_state) {


Re: [Devel] [RH7 PATCH 6/6] patch ploop_state_debugging.patch

2016-06-23 Thread Maxim Patlasov

OK

On 06/23/2016 10:25 AM, Dmitry Monakhov wrote:

Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 090cd2d..9bf8592 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1232,6 +1232,12 @@ static void ploop_complete_request(struct ploop_request 
* preq)
}
preq->bl.tail = NULL;
  
+	if (!preq->error) {

+   unsigned long state = READ_ONCE(preq->state);
+   WARN_ON(state & (PLOOP_REQ_POST_SUBMIT_FL|
+PLOOP_REQ_DEL_CONV_FL |
+PLOOP_REQ_DEL_FLUSH_FL ));
+   }
if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
if (preq->error)


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/9] ploop: deadcode cleanup

2016-06-23 Thread Maxim Patlasov
From: Dmitry Monakhov 

Rebase Dima's patch on top of rh7-3.10.0-327.18.2.vz7.14.19:

(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_direct.c |   10 --
 1 file changed, 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 50c0ed1..3acae79 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -113,16 +113,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
 
rw &= ~(REQ_FLUSH | REQ_FUA);
 
-
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init(&bl);
 
if (iblk == PLOOP_ZERO_INDEX)

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/9] ploop: resurrect delayed_fua for io_kaio

2016-06-23 Thread Maxim Patlasov
After long thinking it now seems to be clear how
delayed_fua was supposed to work for io_kaio:

1) An incoming bio marked as REQ_FUA leads to this bit
set in preq->req_rw.

2) kaio_submit does nothing with this bit in preq->req_rw.
It only initiates sending data by aio_kernel_submit.

3) When userspace ACKs this WRITE, kaio_complete_io_state
discovers that even though REQ_FUA bit is set in preq->req_rw,
eng_state == E_DATA_WBI, so we can delay flush until index
update.

NB: It is crucial here, that preq->req_rw still needs REQ_FUA
bit set!

4) index update calls ->write_page() with fua=1 because
it detects REQ_FUA bit set in preq->req_rw.

5) kaio_write_page observes fua=1 and so set PLOOP_REQ_KAIO_FSYNC
in preq->state. Then it initiates sending data (BAT update).

6) When userspace ACKs this WRITE (BAT update),
kaio_complete_io_state detects PLOOP_REQ_KAIO_FSYNC bit set,
so it clears it and enforces post_fsync=1.

The patch fixes 3) that was broken so far.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_kaio.c |   12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 69df456..e4e4411 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -68,6 +68,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
struct ploop_device * plo   = preq->plo;
unsigned long flags;
int post_fsync = 0;
+   int need_fua = !!(preq->req_rw & REQ_FUA);
 
if (preq->error || !(preq->req_rw & REQ_FUA) ||
preq->eng_state == PLOOP_E_INDEX_READ ||
@@ -80,14 +81,11 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
 
/* Convert requested fua to fsync */
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
-   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))
+   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state) ||
+   (need_fua && !ploop_req_delay_fua_possible(preq))) {
post_fsync = 1;
-
-   if (!post_fsync &&
-   !(ploop_req_delay_fua_possible(preq) && (preq->req_rw & REQ_FUA)))
-   post_fsync = 1;
-
-   preq->req_rw &= ~REQ_FUA;
+   preq->req_rw &= ~REQ_FUA;
+   }
 
if (post_fsync) {
spin_lock_irqsave(&plo->lock, flags);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/9] ploop: minor rework of ploop_req_delay_fua_possible

2016-06-23 Thread Maxim Patlasov
No functional changes. The patch simplifies ploop_req_delay_fua_possible
to make it more suitable for next patch. As was recently discussed,
"eng_state == E_DATA_WBI" is lesser prone to errors than
"eng_state != E_COMPLETE".

Note, how the patch makes a bug in kaio_complete_io_state() obvious:
if !(preq->req_rw & REQ_FUA), it must not matter what
ploop_req_delay_fua_possible() returns! I.e., eng_state==E_COMPLETE is
not sufficient ground for post_fsync=1 if no REQ_FUA set.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_direct.c |2 +-
 drivers/block/ploop/io_kaio.c   |3 +--
 include/linux/ploop/ploop.h |   15 ++-
 3 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 3acae79..0907540 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -103,7 +103,7 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
postfua = 1;
 
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
+   if (!postfua && ploop_req_delay_fua_possible(preq) && (rw & REQ_FUA)) {
 
/* Mark req that delayed flush required */
set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 81da1c5..69df456 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -84,8 +84,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
post_fsync = 1;
 
if (!post_fsync &&
-   !ploop_req_delay_fua_possible(preq->req_rw, preq) &&
-   (preq->req_rw & REQ_FUA))
+   !(ploop_req_delay_fua_possible(preq) && (preq->req_rw & REQ_FUA)))
post_fsync = 1;
 
preq->req_rw &= ~REQ_FUA;
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 3441e7e..e1d8686 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -613,20 +613,9 @@ void ploop_preq_drop(struct ploop_device * plo, struct 
list_head *drop_list,
  int keep_locked);
 
 
-static inline int ploop_req_delay_fua_possible(unsigned long rw,
-   struct ploop_request *preq)
+static inline int ploop_req_delay_fua_possible(struct ploop_request *preq)
 {
-   int delay_fua = 0;
-
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should post
-* fua.
-*/
-   if (rw & REQ_FUA) {
-   if (preq->eng_state != PLOOP_E_COMPLETE)
-   delay_fua = 1;
-   }
-   return delay_fua;
+   return preq->eng_state == PLOOP_E_DATA_WBI;
 }
 
 static inline void ploop_req_set_error(struct ploop_request * preq, int err)

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/9] ploop: minor rework of ->write_page() io method

2016-06-23 Thread Maxim Patlasov
From: Dmitry Monakhov 

No functional changes. Next patch will use this
rework to pass REQ_FLUSH to dio_write_page().

The patch is actually a part of Dima's patch:

> [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v3

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_direct.c |5 ++---
 drivers/block/ploop/io_kaio.c   |8 +---
 drivers/block/ploop/map.c   |5 +++--
 include/linux/ploop/ploop.h |2 +-
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 0907540..db82a61 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1505,15 +1505,14 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
 
 static void
 dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-  struct page * page, sector_t sec, int fua)
+  struct page * page, sector_t sec, unsigned long rw)
 {
if (!(io->files.file->f_mode & FMODE_WRITE)) {
PLOOP_FAIL_REQUEST(preq, -EBADF);
return;
}
 
-   dio_io_page(io, WRITE | (fua ? REQ_FUA : 0) | REQ_SYNC,
-   preq, page, sec);
+   dio_io_page(io, rw | WRITE | REQ_SYNC, preq, page, sec);
 }
 
 static int
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index e4e4411..73edc5e 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -612,12 +612,14 @@ kaio_read_page(struct ploop_io * io, struct ploop_request 
* preq,
 
 static void
 kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
-struct page * page, sector_t sec, int fua)
+struct page * page, sector_t sec, unsigned long rw)
 {
ploop_prepare_tracker(preq, sec);
 
-   /* No FUA in kaio, convert it to fsync */
-   if (fua)
+   /* No FUA in kaio, convert it to fsync. Don't care
+  about REQ_FLUSH: only io_direct relies on it,
+  io_kaio implements delay_fua in another way... */
+   if (rw & REQ_FUA)
set_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state);
 
kaio_io_page(io, IOCB_CMD_WRITE_ITER, preq, page, sec);
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3ba8a22..ae6cc15 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -966,7 +966,7 @@ void ploop_index_update(struct ploop_request * preq)
set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 
top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
- !!(preq->req_rw & REQ_FUA));
+ preq->req_rw & REQ_FUA);
put_page(page);
return;
 
@@ -1210,7 +1210,8 @@ static void map_wb_complete(struct map_node * m, int err)
if (force_fua)
set_bit(PLOOP_REQ_FORCE_FUA, &main_preq->state);
 
-   top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec, 
fua);
+   top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec,
+ fua ? REQ_FUA : 0);
put_page(page);
 }
 
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index e1d8686..3e53b35 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -164,7 +164,7 @@ struct ploop_io_ops
void(*read_page)(struct ploop_io * io, struct ploop_request * preq,
 struct page * page, sector_t sec);
void(*write_page)(struct ploop_io * io, struct ploop_request * preq,
- struct page * page, sector_t sec, int fua);
+ struct page * page, sector_t sec, unsigned long 
rw);
 
 
int (*sync_read)(struct ploop_io * io, struct page * page,

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 6/9] ploop: remove preflush from dio_submit

2016-06-23 Thread Maxim Patlasov
After commit c2247f3745 fixing barriers for ordinary
requests and previous patch fixing delay_fua,
that legacy code in dio_submit processing
(preq->req_rw & REQ_FLUSH) by setting REQ_FLUSH in
the first outgoing bio must die: it is incorrect
anyway (we don't wait for completion of the first
bio before sending others).

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_direct.c |7 ---
 1 file changed, 7 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 1ea2008..ee3cd5c 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -89,15 +89,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
sector_t sec, nsec;
int err;
struct bio_list_walk bw;
-   int preflush;
int postfua = 0;
int write = !!(rw & REQ_WRITE);
int delayed_fua = 0;
 
trace_submit(preq);
 
-   preflush = !!(rw & REQ_FLUSH);
-
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
postfua = 1;
 
@@ -236,10 +233,6 @@ flush_bio:
b->bi_private = preq;
b->bi_end_io = dio_endio_async;
 
-   if (unlikely(preflush)) {
-   rw2 |= REQ_FLUSH;
-   preflush = 0;
-   }
if (unlikely(postfua && !bl.head))
rw2 |= REQ_FUA;
 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/9] ploop: resurrect delay_fua for io_direct

2016-06-23 Thread Maxim Patlasov
Recent commit c2247f3745 while fixing barriers for ordinary
requests, accidentally smashed delay_fua optimization for
io_direct by:

> +   bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);

The idea is the following: if at least one incoming bio is marked as
FUA (it is actually equivalent to (rw & REQ_FUA) check), and
eng_state == E_DATA_WBI, we can delay FUA until index update and
implement it there by REQ_FLUSH.

It is not clear if this optimization provides any benefits, but if
we lived with it for long so far, let's keep it for now.

The patch removes PLOOP_REQ_FORCE_FLUSH thoroughly because it's
easier to use REQ_FLUSH bit in preq->req_rw instead.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/io_direct.c |   15 ++-
 drivers/block/ploop/map.c   |   25 ++---
 include/linux/ploop/ploop.h |1 -
 3 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index db82a61..1ea2008 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -92,23 +92,19 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int preflush;
int postfua = 0;
int write = !!(rw & REQ_WRITE);
+   int delayed_fua = 0;
 
trace_submit(preq);
 
preflush = !!(rw & REQ_FLUSH);
 
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
postfua = 1;
 
-   if (!postfua && ploop_req_delay_fua_possible(preq) && (rw & REQ_FUA)) {
-
+   if ((rw & REQ_FUA) && ploop_req_delay_fua_possible(preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   preq->req_rw |= (REQ_FLUSH | REQ_FUA);
+   delayed_fua = 1;
}
 
rw &= ~(REQ_FLUSH | REQ_FUA);
@@ -222,7 +218,8 @@ flush_bio:
goto flush_bio;
}
 
-   bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);
+   bio->bi_rw |= bw.cur->bi_rw &
+   (REQ_FLUSH | delayed_fua ? 0 : REQ_FUA);
bw.bv_off += copy;
size -= copy >> 9;
sec += copy >> 9;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index ae6cc15..f87fb08 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -908,6 +908,7 @@ void ploop_index_update(struct ploop_request * preq)
int old_level;
struct page * page;
sector_t sec;
+   unsigned long rw;
 
/* No way back, we are going to initiate index write. */
 
@@ -965,8 +966,14 @@ void ploop_index_update(struct ploop_request * preq)
test_bit(PLOOP_REQ_RELOC_S, &preq->state))
set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
 
-   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
- preq->req_rw & REQ_FUA);
+   rw = (preq->req_rw & (REQ_FUA | REQ_FLUSH));
+
+   /* We've just set REQ_FLUSH in rw, ->write_page() below
+  will do the FLUSH */
+   preq->req_rw &= ~REQ_FLUSH;
+
+   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, rw);
+
put_page(page);
return;
 
@@ -1085,7 +1092,8 @@ static void map_wb_complete(struct map_node * m, int err)
int delayed = 0;
unsigned int idx;
sector_t sec;
-   int fua, force_fua;
+   int force_fua;
+   unsigned long rw;
 
/* First, complete processing of written back indices,
 * finally instantiate indices in mapping cache.
@@ -1155,7 +1163,7 @@ static void map_wb_complete(struct map_node * m, int err)
copy_index_for_wb(page, m, top_delta->level);
 
main_preq = NULL;
-   fua = 0;
+   rw = 0;
force_fua = 0;
 
list_for_each_safe(cursor, tmp, &m->io_queue) {
@@ -1175,8 +1183,11 @@ static void map_wb_complete(struct map_node * m, int err)
break;
}
 
-   if (preq->req_rw & REQ_FUA)
-   fua = 1;
+   rw |= (preq->req_rw & (REQ_FLUSH | REQ_FUA));
+
+   /* We've just set REQ_FLUSH in rw, ->write_page() below
+  will do the FLUSH */
+   preq->req_rw &= ~REQ_FLUSH;
 
if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
test_bit(PLOOP_

[Devel] [PATCH rh7 8/9] ploop: fix barriers for PLOOP_E_RELOC_NULLIFY

2016-06-23 Thread Maxim Patlasov
The last step of processing of RELOC_A request is
nullifying BAT block. We smartly noticed, that flush
needed after that, but fsync is not enough:

>   /*
>* Lately we think we does sync of nullified blocks at format
>* driver by image fsync before header update.
>* But we write this data directly into underlying device
>* bypassing EXT4 by usage of extent map tree
>* (see dio_submit()). So fsync of EXT4 image doesnt help us.
>* We need to force sync of nullified blocks.
>*/
>   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>   top_delta->io.ops->submit(&top_delta->io, preq, preq->req_rw,
> &sbl, preq->iblock, 1<cluster_log);

Unfortunately, the way how we handle FORCE_FUA in dio_submit
(sending last bio with REQ_FUA bit set) is not safe: firstly because
we decided that ploop shouldn't strongly rely on the assumption of
equivalence of REQ_FUA and post-FLUSH; and secondly because dio_submit
cannot ensure that that last bio marked as REQ_FUA won't be actually
processed before others.

To fix this problem the patch makes explicit ->issue_flush to flush
nullified block.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |   11 ++-
 drivers/block/ploop/io_direct.c |3 ++-
 drivers/block/ploop/map.c   |4 +++-
 include/linux/ploop/ploop.h |1 +
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 557ddba..2b60dfa 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1305,6 +1305,8 @@ static void ploop_complete_request(struct ploop_request * 
preq)
}
preq->bl.tail = NULL;
 
+   WARN_ON(!preq->error && test_bit(PLOOP_REQ_ISSUE_FLUSH, &preq->state));
+
if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
if (preq->error)
@@ -2429,6 +2431,13 @@ static void ploop_req_state_process(struct ploop_request 
* preq)
preq->eng_io = NULL;
}
 
+   if (test_bit(PLOOP_REQ_ISSUE_FLUSH, &preq->state)) {
+   preq->eng_io->ops->issue_flush(preq->eng_io, preq);
+   clear_bit(PLOOP_REQ_ISSUE_FLUSH, &preq->state);
+   preq->eng_io = NULL;
+   goto out;
+   }
+
 restart:
BUG_ON(test_bit(PLOOP_REQ_POST_SUBMIT, &preq->state));
__TRACE("ST %p %u %lu\n", preq, preq->req_cluster, preq->eng_state);
@@ -2705,7 +2714,7 @@ restart:
default:
BUG();
}
-
+out:
if (release_ioc) {
struct io_context * ioc = current->io_context;
current->io_context = saved_ioc;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 94936c7..c4d0f63 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -413,6 +413,7 @@ try_again:
 
preq->iblock = iblk;
preq->eng_io = io;
+   BUG_ON(test_bit(PLOOP_REQ_ISSUE_FLUSH, &preq->state));
set_bit(PLOOP_REQ_POST_SUBMIT, &preq->state);
dio_submit_pad(io, preq, sbl, size, em);
err = 0;
@@ -1819,7 +1820,7 @@ static void dio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
 
atomic_inc(&preq->io_count);
ploop_acc_ff_out(io->plo, preq->req_rw | bio->bi_rw);
-   submit_bio(preq->req_rw, bio);
+   submit_bio(WRITE_FLUSH, bio);
ploop_complete_io_request(preq);
 }
 
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index f87fb08..915a216 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -1077,7 +1077,9 @@ static void map_wb_complete_post_process(struct ploop_map 
*map,
 * (see dio_submit()). So fsync of EXT4 image doesnt help us.
 * We need to force sync of nullified blocks.
 */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
+   preq->eng_io = &top_delta->io;
+   BUG_ON(test_bit(PLOOP_REQ_POST_SUBMIT, &preq->state));
+   set_bit(PLOOP_REQ_ISSUE_FLUSH, &preq->state);
top_delta->io.ops->submit(&top_delta->io, preq, preq->req_rw,
  &sbl, preq->iblock, 1<cluster_log);
 }
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index af222f1..920daf7 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -479,6 +479,7 @@ enum
PLOOP_REQ_POST_SUBMIT, /* preq needs post_submit processing */
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() perform

[Devel] [PATCH rh7 9/9] ploop: fixup barrier handling during relocation

2016-06-23 Thread Maxim Patlasov
Rebase Dima's patch on top of rh7-3.10.0-327.18.2.vz7.14.19,
but without help of delayed_flush engine:

To ensure consistency on crash/power outage/hard reboot
events, ploop must implement the following barrier logic
for RELOC_A|S requests:

1) After we store data to new place, but before updating
BAT on disk, we have FLUSH everything (in fact, flushing
those data would be enough, but it is simplier to flush
everything).

2) We should not proceed handling RELOC_A|S until we
100% sure new BAT value went to disk platters. So far as
new BAT is only one page, it's OK to mark corresponding
bio with FUA flag for io_direct case. For io_kaio, not
having FUA api, we have to post_fsync BAT update.

PLOOP_REQ_FORCE_FLUSH/PLOOP_REQ_FORCE_FUA introduced
long time ago probably were intended to ensure the
logic above, but they actually didn't.

The patch removes PLOOP_REQ_FORCE_FLUSH/PLOOP_REQ_FORCE_FUA,
and implements barriers in a straightforward and simple way:
check for RELOC_A|S explicitly and make FLUSH/FUA where
needed.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |4 ++--
 drivers/block/ploop/io_direct.c |7 ---
 drivers/block/ploop/io_kaio.c   |8 +---
 drivers/block/ploop/map.c   |   22 ++
 include/linux/ploop/ploop.h |1 -
 5 files changed, 17 insertions(+), 25 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2b60dfa..40768b6 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2610,8 +2610,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
+   /* Relocated data write required sync before BAT update
+* this will happen inside index_update */
 
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index c4d0f63..266f041 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -89,15 +89,11 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
sector_t sec, nsec;
int err;
struct bio_list_walk bw;
-   int postfua = 0;
int write = !!(rw & REQ_WRITE);
int delayed_fua = 0;
 
trace_submit(preq);
 
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
if ((rw & REQ_FUA) && ploop_req_delay_fua_possible(preq)) {
/* Mark req that delayed flush required */
preq->req_rw |= (REQ_FLUSH | REQ_FUA);
@@ -233,9 +229,6 @@ flush_bio:
b->bi_private = preq;
b->bi_end_io = dio_endio_async;
 
-   if (unlikely(postfua && !bl.head))
-   rw2 |= REQ_FUA;
-
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
}
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index ed550f4..85863df 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -69,6 +69,8 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
unsigned long flags;
int post_fsync = 0;
int need_fua = !!(preq->req_rw & REQ_FUA);
+   unsigned long state = READ_ONCE(preq->state);
+   int reloc = !!(state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL));
 
if (preq->error || !(preq->req_rw & REQ_FUA) ||
preq->eng_state == PLOOP_E_INDEX_READ ||
@@ -80,9 +82,9 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
}
 
/* Convert requested fua to fsync */
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
-   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state) ||
-   (need_fua && !ploop_req_delay_fua_possible(preq))) {
+   if (test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state) ||
+   (need_fua && !ploop_req_delay_fua_possible(preq)) ||
+   (reloc && ploop_req_delay_fua_possible(preq))) {
post_fsync = 1;
preq->req_rw &= ~REQ_FUA;
}
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 915a216..1883674 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -909,6 +909,7 @@ void ploop_index_update(struct ploop_request * preq)
struct page * page;
sector_t sec;
unsigned long rw;
+   unsigned long state = READ_ONCE(preq->state);
 
/* No way back, we are going to initiate index write. */
 
@@ -961,10 +962,6 @@ void ploop_index_update

[Devel] [PATCH rh7 0/9] ploop: fix barriers for reloc requests

2016-06-23 Thread Maxim Patlasov
The series firstly fixes a few issues in handling
barriers in ordinary requests (what was overlooked
in previous patch -- see commit c2247f3745).

Then there are a few minor rework w/o functional
changes that alleviate main patches (last two ones).

And finally the series fixes handling barriers
for RELOC_A|S requests.

The main complexity comes from the following bug:
for direct_io it's not enough to send FUA to flush
all nullified cluster block. See details in
"fix barriers for PLOOP_E_RELOC_NULLIFY" patch.

---

Dmitry Monakhov (3):
  ploop: deadcode cleanup
  ploop: minor rework of ->write_page() io method
  ploop: generalize issue_flush

Maxim Patlasov (6):
  ploop: minor rework of ploop_req_delay_fua_possible
  ploop: resurrect delayed_fua for io_kaio
  ploop: resurrect delay_fua for io_direct
  ploop: remove preflush from dio_submit
  ploop: fix barriers for PLOOP_E_RELOC_NULLIFY
  ploop: fixup barrier handling during relocation


 drivers/block/ploop/dev.c   |   16 ++--
 drivers/block/ploop/io_direct.c |   48 -
 drivers/block/ploop/io_kaio.c   |   26 ++--
 drivers/block/ploop/map.c   |   50 ---
 include/linux/ploop/ploop.h |   20 +++-
 5 files changed, 71 insertions(+), 89 deletions(-)

--
Signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 7/9] ploop: generalize issue_flush

2016-06-23 Thread Maxim Patlasov
From: Dmitry Monakhov 

Rebase Dima's patch to rh7-3.10.0-327.18.2.vz7.14.19:

Currently io->ops->issue_flush is called only from single place,
but it has potential to generic. Patch does not change actual logic,
but allow to call ->issue_flush from various places

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |1 +
 drivers/block/ploop/io_direct.c |1 -
 drivers/block/ploop/io_kaio.c   |1 -
 3 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6b5702f..557ddba 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2063,6 +2063,7 @@ ploop_entry_request(struct ploop_request * preq)
if (preq->req_size == 0) {
if (preq->req_rw & REQ_FLUSH &&
!test_bit(PLOOP_REQ_FSYNC_DONE, &preq->state)) {
+   preq->eng_state = PLOOP_E_COMPLETE;
if (top_io->ops->issue_flush) {
top_io->ops->issue_flush(top_io, preq);
return;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index ee3cd5c..94936c7 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1818,7 +1818,6 @@ static void dio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
bio->bi_private = preq;
 
atomic_inc(&preq->io_count);
-   preq->eng_state = PLOOP_E_COMPLETE;
ploop_acc_ff_out(io->plo, preq->req_rw | bio->bi_rw);
submit_bio(preq->req_rw, bio);
ploop_complete_io_request(preq);
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 73edc5e..ed550f4 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -957,7 +957,6 @@ static void kaio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
 {
struct ploop_delta *delta = container_of(io, struct ploop_delta, io);
 
-   preq->eng_state = PLOOP_E_COMPLETE;
preq->req_rw &= ~REQ_FLUSH;
 
spin_lock_irq(&io->plo->lock);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 6/9] ploop: remove preflush from dio_submit

2016-06-24 Thread Maxim Patlasov

On 06/24/2016 07:42 AM, Dmitry Monakhov wrote:

Maxim Patlasov  writes:


After commit c2247f3745 fixing barriers for ordinary
requests and previous patch fixing delay_fua,
that legacy code in dio_submit processing
(preq->req_rw & REQ_FLUSH) by setting REQ_FLUSH in
the first outgoing bio must die: it is incorrect
anyway (we don't wait for completion of the first
bio before sending others).

Wow. This is so true. BTW: Reasonable way to handle FLUSH
is to queue such preq to preflush_queue similar to fsync_queue for
fsync_thread infrastructure


This would add another WAIT. Sometimes it may be beneficial (many 
incoming bio-s marked as REQ_FLUSH), sometimes not (only one bio with 
REQ_FLUSH -- so we'll mark only one of outgoing bio-s with REQ_FLUSH). 
Who knows how often we have more than one REQ_FLUSH in queue...






Signed-off-by: Maxim Patlasov 
---
  drivers/block/ploop/io_direct.c |7 ---
  1 file changed, 7 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 1ea2008..ee3cd5c 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -89,15 +89,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
sector_t sec, nsec;
int err;
struct bio_list_walk bw;
-   int preflush;
int postfua = 0;
int write = !!(rw & REQ_WRITE);
int delayed_fua = 0;
  
  	trace_submit(preq);
  
-	preflush = !!(rw & REQ_FLUSH);

-
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
postfua = 1;
  
@@ -236,10 +233,6 @@ flush_bio:

b->bi_private = preq;
b->bi_end_io = dio_endio_async;
  
-		if (unlikely(preflush)) {

-   rw2 |= REQ_FLUSH;
-   preflush = 0;
-   }
if (unlikely(postfua && !bl.head))
rw2 |= REQ_FUA;
  


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-06-28 Thread Maxim Patlasov
Overlayfs is in "TECH PREVIEW" state right now. Letting CT users to freely
mount and exercise overlayfs, we risk to have the whole node crashed.

Let's disable it for CT users by default. Customers who need it (e.g. to
run Docker in CT) may enable it like this:

# echo 1 > /proc/sys/fs/experimental_fs_enable

The patch is a temporary (awkward) workaround until we make overlayfs
production-ready. Then we'll roll back the patch.

https://jira.sw.ru/browse/PSBM-47981

Signed-off-by: Maxim Patlasov 
---
 fs/filesystems.c |7 ++-
 fs/overlayfs/super.c |2 +-
 include/linux/fs.h   |2 ++
 include/linux/ve.h   |1 +
 kernel/sysctl.c  |7 +++
 kernel/ve/ve.c   |1 +
 6 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index beaba56..38fe4e0 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Handling of filesystem drivers list.
@@ -219,7 +220,11 @@ int __init get_filesystem_list(char *buf)
 
 static inline bool filesystem_permitted(const struct file_system_type *fs)
 {
-   return ve_is_super(get_exec_env()) || (fs->fs_flags & FS_VIRTUALIZED);
+   return ve_is_super(get_exec_env()) ||
+   (fs->fs_flags & FS_VIRTUALIZED) ||
+   ((fs->fs_flags & FS_EXPERIMENTAL) &&
+get_exec_env()->experimental_fs_enable &&
+get_ve0()->experimental_fs_enable);
 }
 
 #ifdef CONFIG_PROC_FS
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index c20cfe9..d5c57b4 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1129,7 +1129,7 @@ static struct file_system_type ovl_fs_type = {
.name   = "overlay",
.mount  = ovl_mount,
.kill_sb= kill_anon_super,
-   .fs_flags   = FS_VIRTUALIZED,
+   .fs_flags   = FS_EXPERIMENTAL,
 };
 MODULE_ALIAS_FS("overlay");
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7203dba..6c91e4b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2108,6 +2108,8 @@ struct file_system_type {
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
 #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */
 #define FS_VIRTUALIZED 64  /* Can mount this fstype inside ve */
+#define FS_EXPERIMENTAL128 /* Ability to mount this fstype 
inside ve
+* is governed by 
experimental_fs_enable */
 #define FS_HAS_RM_XQUOTA   256 /* KABI: fs has the rm_xquota quota op 
*/
 #define FS_HAS_INVALIDATE_RANGE512 /* FS has new ->invalidatepage 
with length arg */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 247cadb..1fc6eb5 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -101,6 +101,7 @@ struct ve_struct {
 
int odirect_enable;
int fsync_enable;
+   int experimental_fs_enable;
 
u64 _uevent_seqnum;
struct nsproxy __rcu*ve_ns;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c8f7bc3..c1c410f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1774,6 +1774,13 @@ static struct ctl_table fs_table[] = {
.proc_handler   = proc_dointvec_virtual,
},
{
+   .procname   = "experimental_fs_enable",
+   .data   = &ve0.experimental_fs_enable,
+   .maxlen = sizeof(int),
+   .mode   = 0644 | S_ISVTX,
+   .proc_handler   = proc_dointvec_virtual,
+   },
+   {
.procname   = "pipe-max-size",
.data   = &pipe_max_size,
.maxlen = sizeof(int),
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index d196e3e..0a2892f 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -643,6 +643,7 @@ static struct cgroup_subsys_state *ve_create(struct cgroup 
*cg)
 
ve->odirect_enable = 2;
ve->fsync_enable = 2;
+   ve->experimental_fs_enable = 2;
 
 #ifdef CONFIG_VE_IPTABLES
ve->ipt_mask = ve_setup_iptables_mask(VE_IP_DEFAULT);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RH7 PATCH] ploop: reloc vs extent_conversion race fix

2016-07-01 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 

On 06/30/2016 06:34 PM, Dmitry Monakhov wrote:

We have fixed most relocation bugs during fixing 
https://jira.sw.ru/browse/PSBM-47107

Currently reloc_a looks like follows:

  1->read_data_from_old_post
  2->write_to_new_pos
 ->sumbit_alloc
   ->submit_pad
   ->post_submit->convert_unwritten
  3->update_index ->write_page with FLUSH|FUA
  4->nullify_old_pos
  5->issue_flush

But on step 3 extent coversion is not yet stable because belongs to uncommitted
transaction. We MUST call ->fsync inside ->post_sumit as we do for REQ_FUA
requests. Let's tag relocatoin requests as FUA from very beginning in order to
assert sync semantics.

https://jira.sw.ru/browse/PSBM-49143
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 40768b6..e5f010b 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-04 Thread Maxim Patlasov

On 07/04/2016 08:53 AM, Vladimir Davydov wrote:


On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
...

@@ -643,6 +643,7 @@ static struct cgroup_subsys_state *ve_create(struct cgroup 
*cg)
  
  	ve->odirect_enable = 2;

ve->fsync_enable = 2;
+   ve->experimental_fs_enable = 2;

For odirect_enable and fsync_enable, 2 means follow the host's config, 1
means enable unconditionally, and 0 means disable unconditionally. But
we don't want to allow a user inside a CT to enable this feature, right?


I thought it's OK to allow user inside CT to enable it if host sysadmin 
is OK about it. The same logic as for odirect: by default 
ve0->experimental_fs_enable = 0, so whatever user inside CT writes to 
this knob, the feature is disabled. If sysadmin writes '1' to ve0->..., 
the feature becomes enabled. If an user wants to voluntarily disable it 
inside CT, that's OK too.



This is confusing. May be, we'd better add a new VE_FEATURE for the
purpose?


Not sure right now. I'll look at it and let you know later.


Thanks,
Maxim



  
  #ifdef CONFIG_VE_IPTABLES

ve->ipt_mask = ve_setup_iptables_mask(VE_IP_DEFAULT);



___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-05 Thread Maxim Patlasov
Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):

> This optimization becomes more important due to the fact that customers tend 
> to use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:

> 1->read_data_from_old_post
> 2->write_to_new_pos
>   ->sumbit_alloc
>  ->submit_pad
>  ->post_submit->convert_unwritten
> 3->update_index ->write_page with FLUSH|FUA
> 4->nullify_old_pos
>5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |4 ++--
 drivers/block/ploop/io_direct.c |   25 -
 drivers/block/ploop/io_kaio.c   |3 ++-
 drivers/block/ploop/map.c   |   17 -
 include/linux/ploop/ploop.h |3 ++-
 5 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b..40768b6 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 1086850..0a5fb15 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
 
 static void
 dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-  struct page * page, sector_t sec, unsigned long rw)
+  struct page * page, sector_t sec, unsigned long rw,
+  int do_fsync_if_delayed)
 {
if (!(io->files.file->f_mode & FMODE_WRITE)) {
PLOOP_FAIL_REQUEST(preq, -EBADF);
return;
}
 
+   if (do_fsync_if_delayed &&
+   test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) {
+   struct ploop_device * plo = io->plo;
+   u64 io_count;
+   int err;
+
+   spin_lock_irq(&plo->lock);
+   io_count = io->io_count;
+   spin_unlock_irq(&plo->lock);
+
+   err = io->ops->sync(io);
+   if (err) {
+   PLOOP_FAIL_REQUEST(preq, -EBADF);
+   return;
+   }
+
+   spin_lock_irq(&plo->lock);
+   if (io_count == io->io_count && !(io_count & 1))
+   clear_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
+   spin_unlock_irq(&plo->lock);
+   }
+
dio_io_page(io, rw | WRITE | REQ_SYNC, preq, page, sec);
 }
 
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 85863df..0d731ef 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -614,7 +614,8 @@ kaio_read_page(struct ploop_io * io, struct ploop_request * 
preq,
 
 static void
 kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
-struct page * page, sector_t sec, unsigned long rw)
+   struct page * page, sector_t sec, unsigned long rw,
+   int do_fsync_if_delayed)
 {
ploop_prepare_tracker(preq, sec);
 
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 1883674..96e428b 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -910,6 +910,7 @@ void ploop_index_update(struct ploop_request * preq)
   

Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-05 Thread Maxim Patlasov

Vova,


On 07/04/2016 11:03 AM, Maxim Patlasov wrote:

On 07/04/2016 08:53 AM, Vladimir Davydov wrote:


On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
...
@@ -643,6 +643,7 @@ static struct cgroup_subsys_state 
*ve_create(struct cgroup *cg)

ve->odirect_enable = 2;
  ve->fsync_enable = 2;
+ve->experimental_fs_enable = 2;

For odirect_enable and fsync_enable, 2 means follow the host's config, 1
means enable unconditionally, and 0 means disable unconditionally. But
we don't want to allow a user inside a CT to enable this feature, right?


I thought it's OK to allow user inside CT to enable it if host 
sysadmin is OK about it. The same logic as for odirect: by default 
ve0->experimental_fs_enable = 0, so whatever user inside CT writes to 
this knob, the feature is disabled. If sysadmin writes '1' to 
ve0->..., the feature becomes enabled. If an user wants to voluntarily 
disable it inside CT, that's OK too.



This is confusing. May be, we'd better add a new VE_FEATURE for the
purpose?


Not sure right now. I'll look at it and let you know later.


Technically, it is very easy to implement new VE_FEATURE for overlayfs. 
But this approach is less flexible because we return EPERM from 
ve_write_64 if CT is running, and we'll need to involve userspace team 
to make the feature configurable and (possibly) persistent. Do you think 
it's worthy for something we'll get rid of soon anyway (I mean as soon 
as PSBM-47981 resolved)?


Thanks,
Maxim
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] overlayfs: verify upper dentry before unlink and rename

2016-07-05 Thread Maxim Patlasov
Without this patch it is easy to crash node by fiddling
with overlayfs dirs. Backport commit 11f37104 from ms:

From: Miklos Szeredi 

ovl: verify upper dentry before unlink and rename

Unlink and rename in overlayfs checked the upper dentry for staleness by
verifying upper->d_parent against upperdir.  However the dentry can go
stale also by being unhashed, for example.

Expand the verification to actually look up the name again (under parent
lock) and check if it matches the upper dentry.  This matches what the VFS
does before passing the dentry to filesytem's unlink/rename methods, which
excludes any inconsistency caused by overlayfs.

Signed-off-by: Miklos Szeredi 

https://jira.sw.ru/browse/PSBM-47981
---
 fs/overlayfs/dir.c |   59 +---
 1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 33c4771..229b9e4 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -596,21 +596,25 @@ static int ovl_remove_upper(struct dentry *dentry, bool 
is_dir)
 {
struct dentry *upperdir = ovl_dentry_upper(dentry->d_parent);
struct inode *dir = upperdir->d_inode;
-   struct dentry *upper = ovl_dentry_upper(dentry);
+   struct dentry *upper;
int err;
 
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
+   upper = lookup_one_len(dentry->d_name.name, upperdir,
+  dentry->d_name.len);
+   err = PTR_ERR(upper);
+   if (IS_ERR(upper))
+   goto out_unlock;
+
err = -ESTALE;
-   if (upper->d_parent == upperdir) {
-   /* Don't let d_delete() think it can reset d_inode */
-   dget(upper);
+   if (upper == ovl_dentry_upper(dentry)) {
if (is_dir)
err = vfs_rmdir(dir, upper);
else
err = vfs_unlink(dir, upper, NULL);
-   dput(upper);
ovl_dentry_version_inc(dentry->d_parent);
}
+   dput(upper);
 
/*
 * Keeping this dentry hashed would mean having to release
@@ -619,6 +623,7 @@ static int ovl_remove_upper(struct dentry *dentry, bool 
is_dir)
 * now.
 */
d_drop(dentry);
+out_unlock:
mutex_unlock(&dir->i_mutex);
 
return err;
@@ -839,29 +844,39 @@ static int ovl_rename2(struct inode *olddir, struct 
dentry *old,
 
trap = lock_rename(new_upperdir, old_upperdir);
 
-   olddentry = ovl_dentry_upper(old);
-   newdentry = ovl_dentry_upper(new);
-   if (newdentry) {
+
+   olddentry = lookup_one_len(old->d_name.name, old_upperdir,
+  old->d_name.len);
+   err = PTR_ERR(olddentry);
+   if (IS_ERR(olddentry))
+   goto out_unlock;
+
+   err = -ESTALE;
+   if (olddentry != ovl_dentry_upper(old))
+   goto out_dput_old;
+
+   newdentry = lookup_one_len(new->d_name.name, new_upperdir,
+  new->d_name.len);
+   err = PTR_ERR(newdentry);
+   if (IS_ERR(newdentry))
+   goto out_dput_old;
+
+   err = -ESTALE;
+   if (ovl_dentry_upper(new)) {
if (opaquedir) {
-   newdentry = opaquedir;
-   opaquedir = NULL;
+   if (newdentry != opaquedir)
+   goto out_dput;
} else {
-   dget(newdentry);
+   if (newdentry != ovl_dentry_upper(new))
+   goto out_dput;
}
} else {
new_create = true;
-   newdentry = lookup_one_len(new->d_name.name, new_upperdir,
-  new->d_name.len);
-   err = PTR_ERR(newdentry);
-   if (IS_ERR(newdentry))
-   goto out_unlock;
+   if (!d_is_negative(newdentry) &&
+   (!new_opaque || !ovl_is_whiteout(newdentry)))
+   goto out_dput;
}
 
-   err = -ESTALE;
-   if (olddentry->d_parent != old_upperdir)
-   goto out_dput;
-   if (newdentry->d_parent != new_upperdir)
-   goto out_dput;
if (olddentry == trap)
goto out_dput;
if (newdentry == trap)
@@ -917,6 +932,8 @@ static int ovl_rename2(struct inode *olddir, struct dentry 
*old,
 
 out_dput:
dput(newdentry);
+out_dput_old:
+   dput(olddentry);
 out_unlock:
unlock_rename(new_upperdir, old_upperdir);
 out_revert_creds:

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-06 Thread Maxim Patlasov

On 07/06/2016 02:26 AM, Vladimir Davydov wrote:


On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote:

Vova,


On 07/04/2016 11:03 AM, Maxim Patlasov wrote:

On 07/04/2016 08:53 AM, Vladimir Davydov wrote:


On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
...

@@ -643,6 +643,7 @@ static struct cgroup_subsys_state
*ve_create(struct cgroup *cg)
ve->odirect_enable = 2;
  ve->fsync_enable = 2;
+ve->experimental_fs_enable = 2;

For odirect_enable and fsync_enable, 2 means follow the host's config, 1
means enable unconditionally, and 0 means disable unconditionally. But
we don't want to allow a user inside a CT to enable this feature, right?

I thought it's OK to allow user inside CT to enable it if host sysadmin is
OK about it. The same logic as for odirect: by default
ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this
knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the
feature becomes enabled. If an user wants to voluntarily disable it inside
CT, that's OK too.


This is confusing. May be, we'd better add a new VE_FEATURE for the
purpose?

Not sure right now. I'll look at it and let you know later.

Technically, it is very easy to implement new VE_FEATURE for overlayfs. But
this approach is less flexible because we return EPERM from ve_write_64 if
CT is running, and we'll need to involve userspace team to make the feature
configurable and (possibly) persistent. Do you think it's worthy for
something we'll get rid of soon anyway (I mean as soon as PSBM-47981
resolved)?

Fair enough, not much point in introducing yet another feature for the
purpose, at least right now, sysctl should do for the beginning.

Come to think of it, do we really need this sysctl inside containers? I
mean, by enabling this sysctl on the host we open a possible system-wide
security hole, which a CT admin won't be able to mitigate by disabling
overlayfs inside her CT. So why would she need it for? To prevent
non-privileged CT users from mounting overlayfs inside a user ns? But
overlayfs is not permitted to be mounted by a userns root anyway AFAICS.
May be, just drop in-CT sysctl then?


Currently, anyone who can login into CT as root may mount overlayfs, 
then try to exploit its weak sides. This is a problem.


Until we ensure that overlayfs is production-ready (at least does not 
have obvious breaches), let's disable it by default (of course, if ve != 
ve0). Those who want to play with overlayfs at their own risk will 
enable it by turning on some knob on host system (ve == ve0).


I don't think that mixing trusted (overlayfs-enabled) CTs and not 
trusted (overlayfs-disabled) CTs on the same physical node is important 
use-case for now. So, any simple system-wide knob must work. 
Essentially, the same scheme with odirect: by default it is '0' in ve0 
and the root inside CT cannot turn it on; and if it is manually set to 
'1' in ve0, the behavior will depend on per-CT root willing.


Thanks,
Maxim
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-06 Thread Maxim Patlasov

Dima,

On 07/06/2016 04:58 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):


This optimization becomes more important due to the fact that customers tend to 
use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:


 1->read_data_from_old_post
 2->write_to_new_pos
   ->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

NAK from me. What is advantage of this patch?


The advantage is the following: in case of BAT multi-updates, instead of 
doing many fsync-s (one per dio_post_submit), we'll do only one (when 
final ->write_page is called).



Does it makes code more optimal? No


Yes, it does. In the same sense as 9f860e606: saving some fsync-s.


Does it makes main ploop more asynchronous? No.


Correct, the patch optimizes ploop in the other way. It's not about 
making ploop more asynchronous.





If you want to make optimization then it is reasonable to
queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
before processing PLOOP_E_DATA_WBI  state for  preq with FUA
So sequence will looks like follows:
->sumbit_alloc
   ->submit_pad
   ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
->ploop_req_state_process
   case PLOOP_E_DATA_WBI:
   if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
   preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
   list_add_tail(&preq->list, &top_io->fsync_queue)
   return;
}
##Let fsync_thread do it's work
->ploop_req_state_process
case LOOP_E_DATA_WBI:
update_index->write_page with FUA (FLUSH is not required because we  
already done fsync)


That's another type of optimization: making ploop more asynchronous. I 
thought about it, but didn't come to conclusion whether it's worthy 
w.r.t. adding more complexity to ploop-state-machine and possible bugs 
introduced with that.


Thanks,
Maxim




https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
  drivers/block/ploop/dev.c   |4 ++--
  drivers/block/ploop/io_direct.c |   25 -
  drivers/block/ploop/io_kaio.c   |3 ++-
  drivers/block/ploop/map.c   |   17 -
  include/linux/ploop/ploop.h |3 ++-
  5 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b..40768b6 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 1086850..0a5fb15 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
  
  static void

  dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-  struct page * page, sector_t sec, unsigned long rw)
+  struct page * page, sector_t sec, unsigned long rw,
+  int do_fsync_if_delayed)
  {
if (!(io->files.file->f_mode & FMODE_WRITE)) {
PLOOP_FAIL_REQUEST(preq, -EBADF);
return;
}
  
+	if (do_fsync_if_delayed &&

+   test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) {
+

Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-07 Thread Maxim Patlasov

On 07/07/2016 04:10 AM, Vladimir Davydov wrote:


On Wed, Jul 06, 2016 at 10:33:07AM -0700, Maxim Patlasov wrote:

On 07/06/2016 02:26 AM, Vladimir Davydov wrote:


On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote:

Vova,


On 07/04/2016 11:03 AM, Maxim Patlasov wrote:

On 07/04/2016 08:53 AM, Vladimir Davydov wrote:


On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
...

@@ -643,6 +643,7 @@ static struct cgroup_subsys_state
*ve_create(struct cgroup *cg)
ve->odirect_enable = 2;
  ve->fsync_enable = 2;
+ve->experimental_fs_enable = 2;

For odirect_enable and fsync_enable, 2 means follow the host's config, 1
means enable unconditionally, and 0 means disable unconditionally. But
we don't want to allow a user inside a CT to enable this feature, right?

I thought it's OK to allow user inside CT to enable it if host sysadmin is
OK about it. The same logic as for odirect: by default
ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this
knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the
feature becomes enabled. If an user wants to voluntarily disable it inside
CT, that's OK too.


This is confusing. May be, we'd better add a new VE_FEATURE for the
purpose?

Not sure right now. I'll look at it and let you know later.

Technically, it is very easy to implement new VE_FEATURE for overlayfs. But
this approach is less flexible because we return EPERM from ve_write_64 if
CT is running, and we'll need to involve userspace team to make the feature
configurable and (possibly) persistent. Do you think it's worthy for
something we'll get rid of soon anyway (I mean as soon as PSBM-47981
resolved)?

Fair enough, not much point in introducing yet another feature for the
purpose, at least right now, sysctl should do for the beginning.

Come to think of it, do we really need this sysctl inside containers? I
mean, by enabling this sysctl on the host we open a possible system-wide
security hole, which a CT admin won't be able to mitigate by disabling
overlayfs inside her CT. So why would she need it for? To prevent
non-privileged CT users from mounting overlayfs inside a user ns? But
overlayfs is not permitted to be mounted by a userns root anyway AFAICS.
May be, just drop in-CT sysctl then?

Currently, anyone who can login into CT as root may mount overlayfs, then
try to exploit its weak sides. This is a problem.

Until we ensure that overlayfs is production-ready (at least does not have
obvious breaches), let's disable it by default (of course, if ve != ve0).
Those who want to play with overlayfs at their own risk will enable it by
turning on some knob on host system (ve == ve0).

I don't think that mixing trusted (overlayfs-enabled) CTs and not trusted
(overlayfs-disabled) CTs on the same physical node is important use-case for
now. So, any simple system-wide knob must work.




Essentially, the same scheme
with odirect: by default it is '0' in ve0 and the root inside CT cannot turn
it on; and if it is manually set to '1' in ve0, the behavior will depend on
per-CT root willing.

No, that's not how it works. AFAICS (see may_use_odirect),

   ve0 sysctl   ve sysctl   odirect allowed in ve?
   x0   0
   x1   1
   x2   x

i.e. system-wide sysctl can't be used to disallow odirect inside a VE,
while you want a different behavior AFAIU - you want to enable overlayfs
if both ve0 sysctl and ve sysctl are set. That's why the patch looks
confusing to me.


Oh, yeah, it's my fault -- I didn't read may_use_odirect() carefully 
enough. Now I see it checks ve0 only if per-CT sysctl is '2'.




Let's only leave system-wide sysctl for permitting overlayfs. VE sysctl
doesn't make any sense - only root user is allowed to mount overlayfs
inside a CT and she can set this sysctl anyway.


OK, I agree.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default (v2)

2016-07-07 Thread Maxim Patlasov
Overlayfs is in "TECH PREVIEW" state right now. Letting CT users to freely
mount and exercise overlayfs, we risk to have the whole node crashed.

Let's disable it for CT users by default. Customers who need it (e.g. to
run Docker in CT) may enable it like this:

# echo 1 > /proc/sys/fs/experimental_fs_enable

The patch is a temporary (awkward) workaround until we make overlayfs
production-ready. Then we'll roll back the patch.

Changed in v2:
 - let's only leave system-wide sysctl for permitting overlayfs; the sysctl
   is "rw" in ve0, but "ro" inside CT.

https://jira.sw.ru/browse/PSBM-47981
---
 fs/filesystems.c |8 +++-
 fs/overlayfs/super.c |2 +-
 include/linux/fs.h   |4 
 kernel/sysctl.c  |7 +++
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index beaba56..670d228 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 
+/* Affects ability of CT users to mount fs marked as FS_EXPERIMENTAL */
+int sysctl_experimental_fs_enable;
+
 /*
  * Handling of filesystem drivers list.
  * Rules:
@@ -219,7 +222,10 @@ int __init get_filesystem_list(char *buf)
 
 static inline bool filesystem_permitted(const struct file_system_type *fs)
 {
-   return ve_is_super(get_exec_env()) || (fs->fs_flags & FS_VIRTUALIZED);
+   return ve_is_super(get_exec_env()) ||
+   (fs->fs_flags & FS_VIRTUALIZED) ||
+   ((fs->fs_flags & FS_EXPERIMENTAL) &&
+sysctl_experimental_fs_enable);
 }
 
 #ifdef CONFIG_PROC_FS
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index c20cfe9..d5c57b4 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1129,7 +1129,7 @@ static struct file_system_type ovl_fs_type = {
.name   = "overlay",
.mount  = ovl_mount,
.kill_sb= kill_anon_super,
-   .fs_flags   = FS_VIRTUALIZED,
+   .fs_flags   = FS_EXPERIMENTAL,
 };
 MODULE_ALIAS_FS("overlay");
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7203dba..f1c3d5b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -59,6 +59,8 @@ extern struct inodes_stat_t inodes_stat;
 extern int leases_enable, lease_break_time;
 extern int sysctl_protected_symlinks;
 extern int sysctl_protected_hardlinks;
+extern int sysctl_experimental_fs_enable;
+
 
 struct buffer_head;
 typedef int (get_block_t)(struct inode *inode, sector_t iblock,
@@ -2108,6 +2110,8 @@ struct file_system_type {
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
 #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */
 #define FS_VIRTUALIZED 64  /* Can mount this fstype inside ve */
+#define FS_EXPERIMENTAL128 /* Ability to mount this fstype 
inside ve
+* is governed by 
experimental_fs_enable */
 #define FS_HAS_RM_XQUOTA   256 /* KABI: fs has the rm_xquota quota op 
*/
 #define FS_HAS_INVALIDATE_RANGE512 /* FS has new ->invalidatepage 
with length arg */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c8f7bc3..e59dd3b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1781,6 +1781,13 @@ static struct ctl_table fs_table[] = {
.proc_handler   = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
+   {
+   .procname   = "experimental_fs_enable",
+   .data   = &sysctl_experimental_fs_enable,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
{ }
 };
 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/4] ploop: fix fsync_reqs accounting

2016-07-11 Thread Maxim Patlasov
io->fsync_qlen stands for the number of ploop requests waiting for processing
by io fsync thread.

The fix is obvious: each time we add preq to fsync thread queue, we have to
increment fsync_reqs; each time we delete it from the queue, we have to
decrement it.

The fix should not affect anything because currently nobody cares about the
value of io->fsync_qlen. The patch is useful because we expose the value
as /sys/block/ploopN/pstate/fsync_reqs.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |1 +
 drivers/block/ploop/io_direct.c |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 40768b6..9992ae5 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2055,6 +2055,7 @@ ploop_entry_request(struct ploop_request * preq)
!test_bit(PLOOP_REQ_FSYNC_DONE, &preq->state)) {
spin_lock_irq(&plo->lock);
list_add_tail(&preq->list, &top_io->fsync_queue);
+   top_io->fsync_qlen++;
if (waitqueue_active(&top_io->fsync_waitq))
wake_up_interruptible(&top_io->fsync_waitq);
spin_unlock_irq(&plo->lock);
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 0a5fb15..fb1ddea 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -477,9 +477,10 @@ try_again:
ploop_acc_flush_skip_locked(plo, preq->req_rw);
preq->iblock = iblk;
list_add_tail(&preq->list, &io->fsync_queue);
+   io->fsync_qlen++;
plo->st.bio_syncwait++;
if ((test_bit(PLOOP_REQ_SYNC, &preq->state) ||
-++io->fsync_qlen >= plo->tune.fsync_max) &&
+io->fsync_qlen >= plo->tune.fsync_max) &&
waitqueue_active(&io->fsync_waitq))
wake_up_interruptible(&io->fsync_waitq);
else if (!timer_pending(&io->fsync_timer))

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/4] ploop: fix free_list starvation

2016-07-11 Thread Maxim Patlasov
The first patch of the patch-set fixes a minor unrelated
problem. It is trivial.

The remaining three patches try to solve the following
problem:

Under high load, and when push_backup is in progress, it is
possible that all preq-s from free_list will be consumed by
either incoming bio-s waiting for backup tool out-of-band
processing, or some incoming bio-s blocked on the former ones.

Then, ploop reaches maximum possible preq->active_reqs and
goes to sleep waiting for something. But this something is
actually the backup tool who is blocked on reading from
the ploop device. Deadlock.

See per-patch descriptions for details.

https://jira.sw.ru/browse/PSBM-49454

---

Maxim Patlasov (4):
  ploop: fix fsync_reqs accounting
  ploop: introduce plo->free_qlen counter
  ploop: introduce plo->blockable_reqs counter
  ploop: fix free_list starvation


 drivers/block/ploop/dev.c |  107 -
 drivers/block/ploop/io_direct.c   |3 +
 drivers/block/ploop/push_backup.c |   74 +-
 drivers/block/ploop/push_backup.h |6 ++
 drivers/block/ploop/sysfs.c   |   24 
 include/linux/ploop/ploop.h   |5 ++
 6 files changed, 202 insertions(+), 17 deletions(-)

--
Signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/4] ploop: introduce plo->free_qlen counter

2016-07-11 Thread Maxim Patlasov
ploop device maintains the list of free ploop requests: plo->free_list.
Let's count the number of items in the list: plo->free_qlen. The counter
will be used in next patches of this patch-set.

The patch also introduces plo->free_qmax counter -- total number of
allocated ploop requests. This is useful to compare plo->free_qlen
with (in case plo->tune.max_requests changed in flight).

https://jira.sw.ru/browse/PSBM-49454

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |   15 ++-
 drivers/block/ploop/sysfs.c |   12 
 include/linux/ploop/ploop.h |2 ++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 9992ae5..cc33b2d 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -191,6 +191,7 @@ ploop_alloc_request(struct ploop_device * plo)
 
preq = list_entry(plo->free_list.next, struct ploop_request, list);
list_del_init(&preq->list);
+   plo->free_qlen--;
ploop_congest(plo);
return preq;
 }
@@ -231,6 +232,7 @@ void ploop_preq_drop(struct ploop_device * plo, struct 
list_head *drop_list,
  int keep_locked)
 {
struct ploop_request * preq;
+   int drop_qlen = 0;
 
list_for_each_entry(preq, drop_list, list) {
if (preq->ioc) {
@@ -240,11 +242,13 @@ void ploop_preq_drop(struct ploop_device * plo, struct 
list_head *drop_list,
}
 
BUG_ON (test_bit(PLOOP_REQ_ZERO, &preq->state));
+   drop_qlen++;
}
 
spin_lock_irq(&plo->lock);
 
list_splice_init(drop_list, plo->free_list.prev);
+   plo->free_qlen += drop_qlen;
if (waitqueue_active(&plo->req_waitq))
wake_up(&plo->req_waitq);
else if (test_bit(PLOOP_S_WAIT_PROCESS, &plo->state) &&
@@ -489,9 +493,11 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
bio,
 {
struct ploop_request * preq;
 
-   BUG_ON (list_empty(&plo->free_list));
+   BUG_ON(list_empty(&plo->free_list));
+   BUG_ON(plo->free_qlen <= 0);
preq = list_entry(plo->free_list.next, struct ploop_request, list);
list_del_init(&preq->list);
+   plo->free_qlen--;
 
preq->req_cluster = bio->bi_sector >> plo->cluster_log;
bio->bi_next = NULL;
@@ -529,6 +535,7 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
}
BIO_ENDIO(plo->queue, bio, err);
list_add(&preq->list, &plo->free_list);
+   plo->free_qlen++;
plo->bio_discard_qlen--;
plo->bio_total--;
return;
@@ -1387,6 +1394,7 @@ static void ploop_complete_request(struct ploop_request * 
preq)
} else {
ploop_uncongest(plo);
list_add(&preq->list, &plo->free_list);
+   plo->free_qlen++;
if (waitqueue_active(&plo->req_waitq))
wake_up(&plo->req_waitq);
else if (test_bit(PLOOP_S_WAIT_PROCESS, &plo->state) &&
@@ -3799,6 +3807,8 @@ static int ploop_start(struct ploop_device * plo, struct 
block_device *bdev)
preq->plo = plo;
INIT_LIST_HEAD(&preq->delay_list);
list_add(&preq->list, &plo->free_list);
+   plo->free_qlen++;
+   plo->free_qmax++;
}
 
list_for_each_entry_reverse(delta, &plo->map.delta_list, list) {
@@ -3951,8 +3961,11 @@ static int ploop_stop(struct ploop_device * plo, struct 
block_device *bdev)
 
preq = list_first_entry(&plo->free_list, struct ploop_request, 
list);
list_del(&preq->list);
+   plo->free_qlen--;
+   plo->free_qmax--;
kfree(preq);
}
+   BUG_ON(plo->free_qlen);
 
ploop_map_destroy(&plo->map);
if (plo->trans_map)
diff --git a/drivers/block/ploop/sysfs.c b/drivers/block/ploop/sysfs.c
index d6dcc83..c062c1e 100644
--- a/drivers/block/ploop/sysfs.c
+++ b/drivers/block/ploop/sysfs.c
@@ -425,6 +425,16 @@ static ssize_t print_push_backup_uuid(struct ploop_device 
* plo, char * page)
return snprintf(page, PAGE_SIZE, "%pUB\n", uuid);
 }
 
+static u32 show_free_reqs(struct ploop_device * plo)
+{
+   return plo->free_qlen;
+}
+
+static u32 show_free_qmax(struct ploop_device * plo)
+{
+   return plo->free_qmax;
+}
+
 #define _TUNE_U32(_name)   \
 static u32 show_##_name(struct ploop_device * plo) \
 {  \
@@ -507,6 +517,8 @@ s

[Devel] [PATCH rh7 3/4] ploop: introduce plo->blockable_reqs counter

2016-07-11 Thread Maxim Patlasov
The counter represents the number of ploop requests that
potentially can be blocked due to push_backup: let's call
them "blockable" requersts. In the other words they are
those who is going to be dependent on userspace backup tool.

We claim a preq as "blockable" if, at the time of converting
incoming bio to the preq, we observe corresponding bit in
pbd->ppb_map set, and corresponding bit in pbd->reported_map
is clear.

In case of in-flight conversion (ploop_make_request calling
process_bio_queue) the decision is posponed until ploop_thread
process preq by ploop_req_state_process(). This is done
intentionally, to avoid locking scheme complication.

The counter will be used by next patch of this patch-set.

https://jira.sw.ru/browse/PSBM-49454

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c |   36 ++--
 drivers/block/ploop/push_backup.c |   22 ++
 drivers/block/ploop/push_backup.h |1 +
 drivers/block/ploop/sysfs.c   |6 ++
 include/linux/ploop/ploop.h   |2 ++
 5 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index cc33b2d..6795b95 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -227,6 +227,20 @@ static inline void preq_unlink(struct ploop_request * preq,
list_add(&preq->list, drop_list);
 }
 
+static void ploop_set_blockable(struct ploop_device *plo,
+   struct ploop_request *preq)
+{
+   if (!test_and_set_bit(PLOOP_REQ_BLOCKABLE, &preq->state))
+   plo->blockable_reqs++;
+}
+
+static void ploop_test_and_clear_blockable(struct ploop_device *plo,
+  struct ploop_request *preq)
+{
+   if (test_and_clear_bit(PLOOP_REQ_BLOCKABLE, &preq->state))
+   plo->blockable_reqs--;
+}
+
 /* always called with plo->lock released */
 void ploop_preq_drop(struct ploop_device * plo, struct list_head *drop_list,
  int keep_locked)
@@ -242,6 +256,7 @@ void ploop_preq_drop(struct ploop_device * plo, struct 
list_head *drop_list,
}
 
BUG_ON (test_bit(PLOOP_REQ_ZERO, &preq->state));
+   ploop_test_and_clear_blockable(plo, preq);
drop_qlen++;
}
 
@@ -489,7 +504,7 @@ insert_entry_tree(struct ploop_device * plo, struct 
ploop_request * preq0,
 
 static void
 ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
-   struct list_head *drop_list)
+   struct list_head *drop_list, int account_blockable)
 {
struct ploop_request * preq;
 
@@ -511,6 +526,10 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
bio,
preq->iblock = 0;
preq->prealloc_size = 0;
 
+   if (account_blockable && (bio->bi_rw & REQ_WRITE) && bio->bi_size &&
+   ploop_pb_check_and_clear_bit(plo->pbd, preq->req_cluster))
+   ploop_set_blockable(plo, preq);
+
if (unlikely(bio->bi_rw & REQ_DISCARD)) {
int clu_size = 1 << plo->cluster_log;
int i = (clu_size - 1) & bio->bi_sector;
@@ -734,7 +753,9 @@ preallocate_bio(struct bio * orig_bio, struct ploop_device 
* plo)
return nbio;
 }
 
-static void process_bio_queue(struct ploop_device * plo, struct list_head 
*drop_list)
+static void process_bio_queue(struct ploop_device * plo,
+ struct list_head *drop_list,
+ int account_blockable)
 {
while (plo->bio_head && !list_empty(&plo->free_list)) {
struct bio *tmp = plo->bio_head;
@@ -744,7 +765,7 @@ static void process_bio_queue(struct ploop_device * plo, 
struct list_head *drop_
if (!plo->bio_head)
plo->bio_tail = NULL;
 
-   ploop_bio_queue(plo, tmp, drop_list);
+   ploop_bio_queue(plo, tmp, drop_list, account_blockable);
}
 }
 
@@ -796,7 +817,7 @@ process_discard_bio_queue(struct ploop_device * plo, struct 
list_head *drop_list
/* If PLOOP_S_DISCARD isn't set, ploop_bio_queue
 * will complete it with a proper error.
 */
-   ploop_bio_queue(plo, tmp, drop_list);
+   ploop_bio_queue(plo, tmp, drop_list, 0);
}
 }
 
@@ -1001,7 +1022,7 @@ queue:
ploop_congest(plo);
 
/* second chance to merge requests */
-   process_bio_queue(plo, &drop_list);
+   process_bio_queue(plo, &drop_list, 0);
 
 queued:
/* If main thread is waiting for requests, wake it up.
@@ -1371,6 +1392,7 @@ static void ploop_complete_request(struct ploop_request * 
preq)
 
del_lockout(preq);
del_pb_lockout(preq); /* preq may die via ploop_fail_immediat

[Devel] [PATCH rh7 4/4] ploop: fix free_list starvation

2016-07-11 Thread Maxim Patlasov
Under high load, and when push_backup is in progress, it is
possible that all preq-s from free_list will be consumed by
either incoming bio-s waiting for backup tool out-of-band
processing, or some incoming bio-s blocked on the former ones.

Then, ploop reaches maximum possible preq->active_reqs and
goes to sleep waiting for something. But this something is
actually the backup tool who is blocked on reading from
the ploop device. Deadlock.

The patch fixes the problem by queueing incoming WRITE bio-s
(which otherwise would be blocked on backup tool out-of-band
processing anyway) to a separate queue:
plo->pbd->bio_pending_list. Thus, we always have some free
preq-s for processing incoming READ bio-o from backup tool.

https://jira.sw.ru/browse/PSBM-49454

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c |   67 +
 drivers/block/ploop/push_backup.c |   52 -
 drivers/block/ploop/push_backup.h |5 +++
 drivers/block/ploop/sysfs.c   |6 +++
 include/linux/ploop/ploop.h   |1 +
 5 files changed, 116 insertions(+), 15 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6795b95..6d449b7 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -753,20 +753,43 @@ preallocate_bio(struct bio * orig_bio, struct 
ploop_device * plo)
return nbio;
 }
 
-static void process_bio_queue(struct ploop_device * plo,
- struct list_head *drop_list,
- int account_blockable)
+static void process_bio_queue_one(struct ploop_device * plo,
+ struct list_head *drop_list,
+ int check_push_backup)
+{
+   struct bio *bio = plo->bio_head;
+
+   BUG_ON (!plo->bio_tail);
+   plo->bio_head = plo->bio_head->bi_next;
+   if (!plo->bio_head)
+   plo->bio_tail = NULL;
+
+   if (check_push_backup &&
+   (bio->bi_rw & REQ_WRITE) && bio->bi_size &&
+   plo->free_qlen <= plo->free_qmax / 2 &&
+   plo->blockable_reqs > plo->free_qmax / 4 &&
+   ploop_pb_bio_detained(plo->pbd, bio))
+   plo->blocked_bios++;
+   else
+   ploop_bio_queue(plo, bio, drop_list, check_push_backup);
+}
+
+static void process_bio_queue_optional(struct ploop_device * plo,
+  struct list_head *drop_list)
 {
-   while (plo->bio_head && !list_empty(&plo->free_list)) {
-   struct bio *tmp = plo->bio_head;
+   while (plo->bio_head && !list_empty(&plo->free_list) &&
+  (!test_bit(PLOOP_S_PUSH_BACKUP, &plo->state) ||
+   plo->free_qlen > plo->free_qmax / 2))
+   process_bio_queue_one(plo, drop_list, 0);
+}
 
-   BUG_ON (!plo->bio_tail);
-   plo->bio_head = plo->bio_head->bi_next;
-   if (!plo->bio_head)
-   plo->bio_tail = NULL;
+static void process_bio_queue_main(struct ploop_device * plo,
+  struct list_head *drop_list)
+{
+   int check = test_bit(PLOOP_S_PUSH_BACKUP, &plo->state);
 
-   ploop_bio_queue(plo, tmp, drop_list, account_blockable);
-   }
+   while (plo->bio_head && !list_empty(&plo->free_list))
+   process_bio_queue_one(plo, drop_list, check);
 }
 
 static void ploop_unplug(struct blk_plug_cb *cb, bool from_schedule)
@@ -1022,7 +1045,7 @@ queue:
ploop_congest(plo);
 
/* second chance to merge requests */
-   process_bio_queue(plo, &drop_list, 0);
+   process_bio_queue_optional(plo, &drop_list);
 
 queued:
/* If main thread is waiting for requests, wake it up.
@@ -2858,6 +2881,20 @@ static void ploop_handle_enospc_req(struct ploop_request 
*preq)
preq->iblock = 0;
 }
 
+static void
+process_pending_bios(struct ploop_device * plo, struct list_head *drop_list)
+{
+   while (!ploop_pb_bio_list_empty(plo->pbd) &&
+  !list_empty(&plo->free_list) &&
+  (plo->free_qlen > plo->free_qmax / 2 ||
+   plo->blockable_reqs <= plo->free_qmax / 4)) {
+   struct bio *bio = ploop_pb_bio_get(plo->pbd);
+
+   ploop_bio_queue(plo, bio, drop_list, 1);
+   plo->blocked_bios--;
+   }
+}
+
 /* Main process. Processing queues in proper order, handling pre-barrier
  * flushes and queue suspend while processing a barrier
  */
@@ -2879,7 +2916,8 @@ static int ploop_thread(void * data)
again:
BUG_ON (!list_empty(&drop_list));
 
-   process_bio_queue(plo, &drop_list, 1);

Re: [Devel] [PATCH rh7 1/4] fs: do not fail on double freeze bdev w/o sb

2016-07-12 Thread Maxim Patlasov

Vova,


Let's keep flies and cutlets separately. It seems we can easily satisfy 
push-backup needs by implementing freeze/thaw ploop ioctls without 
tackling generic code at all, see a patch in attachment (unless I missed 
something obvious). And apart from these ploop/push-backup stuff, if you 
think your changes for freeze_bdev() and thaw_bdev() are useful, send 
them upstream, so we'll back-port them later, when they are accepted 
upstream (unless I missed some scenario for which those changes matter 
for us). In the other words, I think we have to keep our vz7 generic 
code base closer to ms, unless we have good reason to deviate.



Thanks,

Maxim


On 07/12/2016 03:04 AM, Vladimir Davydov wrote:

It's possible to freeze a bdev which is not mounted. In this case
freeze_bdev() only increments bd_fsfrozen_count in order to prevent the
bdev from being mounted and does nothing else. A second freeze attempt
on the same device is supposed to increment bd_fsfrozen_count again, but
it results in NULL ptr dereference, because freeze_bdev() doesn't check
the return value of get_super(). Fix that.

Signed-off-by: Vladimir Davydov 
---
  fs/block_dev.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4575c62d8b0b..325ee7161fbf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device *bdev)
 * thaw_bdev drops it.
 */
sb = get_super(bdev);
-   drop_super(sb);
+   if (sb)
+   drop_super(sb);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return sb;
}


The ioctls simply freeze and thaw ploop bdev.

Caveats:

1) If no fs mounted, the ioctls have no effect.
2) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one.
3) The same for thaw.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c  |   38 ++
 include/linux/ploop/ploop.h|1 +
 include/linux/ploop/ploop_if.h |6 ++
 3 files changed, 45 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 6d449b7..e583d10 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4892,6 +4892,38 @@ static int ploop_push_backup_stop(struct ploop_device *plo, unsigned long arg)
 	return copy_to_user((void*)arg, &ctl, sizeof(ctl));
 }
 
+static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev)
+{
+	struct super_block *sb = plo->sb;
+
+	if (sb)
+		return 0;
+
+	sb = freeze_bdev(bdev);
+	if (sb && IS_ERR(sb))
+		return PTR_ERR(sb);
+	if (!sb)
+		thaw_bdev(bdev, sb);
+
+	plo->sb = sb;
+	return 0;
+}
+
+static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)
+{
+	struct super_block *sb = plo->sb;
+	int err;
+
+	if (!sb)
+		return 0;
+
+	err = thaw_bdev(bdev, sb);
+	if (!err)
+		plo->sb = NULL;
+
+	return err;
+}
+
 static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cmd,
 		   unsigned long arg)
 {
@@ -5005,6 +5037,12 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int cm
 	case PLOOP_IOC_PUSH_BACKUP_STOP:
 		err = ploop_push_backup_stop(plo, arg);
 		break;
+	case PLOOP_IOC_FREEZE:
+		err = ploop_freeze(plo, bdev);
+		break;
+	case PLOOP_IOC_THAW:
+		err = ploop_thaw(plo, bdev);
+		break;
 	default:
 		err = -EINVAL;
 	}
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 859fe51..e60ada4 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -414,6 +414,7 @@ struct ploop_device
 	struct block_device	*bdev;
 	struct request_queue	*queue;
 	struct task_struct	*thread;
+	struct super_block	*sb;
 	struct rb_node		link;
 
 	/* someone who wants to quiesce state-machine waits
diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h
index a098ca9..302ace9 100644
--- a/include/linux/ploop/ploop_if.h
+++ b/include/linux/ploop/ploop_if.h
@@ -352,6 +352,12 @@ struct ploop_track_extent
 /* Stop push backup */
 #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct ploop_push_backup_stop_ctl)
 
+/* Freeze FS mounted over ploop */
+#define PLOOP_IOC_FREEZE	_IO(PLOOPCTLTYPE, 32)
+
+/* Unfreeze FS mounted over ploop */
+#define PLOOP_IOC_THAW		_IO(PLOOPCTLTYPE, 33)
+
 /* Events exposed via /sys/block/ploopN/pstate/event */
 #define PLOOP_EVENT_ABORTED	1
 #define PLOOP_EVENT_STOPPED	2
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 1/4] fs: do not fail on double freeze bdev w/o sb

2016-07-13 Thread Maxim Patlasov

On 07/13/2016 02:46 AM, Vladimir Davydov wrote:


On Tue, Jul 12, 2016 at 03:02:11PM -0700, Maxim Patlasov wrote:

Let's keep flies and cutlets separately. It seems we can easily satisfy
push-backup needs by implementing freeze/thaw ploop ioctls without tackling
generic code at all, see a patch in attachment (unless I missed something
obvious). And apart from these ploop/push-backup stuff, if you think your
changes for freeze_bdev() and thaw_bdev() are useful, send them upstream, so
we'll back-port them later, when they are accepted upstream (unless I missed
some scenario for which those changes matter for us). In the other words, I
think we have to keep our vz7 generic code base closer to ms, unless we have
good reason to deviate.

Agree. Generally, I like your patch more than mine, but I've a concern
about it - see below.


On 07/12/2016 03:04 AM, Vladimir Davydov wrote:

It's possible to freeze a bdev which is not mounted. In this case
freeze_bdev() only increments bd_fsfrozen_count in order to prevent the
bdev from being mounted and does nothing else. A second freeze attempt
on the same device is supposed to increment bd_fsfrozen_count again, but
it results in NULL ptr dereference, because freeze_bdev() doesn't check
the return value of get_super(). Fix that.

Signed-off-by: Vladimir Davydov 
---
  fs/block_dev.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4575c62d8b0b..325ee7161fbf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device *bdev)
 * thaw_bdev drops it.
 */
sb = get_super(bdev);
-   drop_super(sb);
+   if (sb)
+   drop_super(sb);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return sb;
}

The ioctls simply freeze and thaw ploop bdev.

Caveats:

1) If no fs mounted, the ioctls have no effect.
2) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one.
3) The same for thaw.

I think #2 and #3 are OK. But regarding #1 - what if we want to make a
backup of a secondary ploop which is not mounted? So we try to freeze it
and succeed, but it isn't actually frozen, so it can be mounted and
modified while we're backing it up, which is incorrect AFAIU.

What about something like this on top of your patch?


You're right. The patch looks correct and it works for me. You can add
Acked-by: Maxim Patlasov 



diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 9a9cc8b0b934..d52975eaaa36 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4819,16 +4819,15 @@ static int ploop_freeze(struct ploop_device *plo, 
struct block_device *bdev)
  {
struct super_block *sb = plo->sb;
  
-	if (sb)

+   if (test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
  
  	sb = freeze_bdev(bdev);

if (sb && IS_ERR(sb))
return PTR_ERR(sb);
-   if (!sb)
-   thaw_bdev(bdev, sb);
  
  	plo->sb = sb;

+   set_bit(PLOOP_S_FROZEN, &plo->state);
return 0;
  }
  
@@ -4837,12 +4836,14 @@ static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)

struct super_block *sb = plo->sb;
int err;
  
-	if (!sb)

+   if (!test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
  
  	err = thaw_bdev(bdev, sb);

-   if (!err)
+   if (!err) {
plo->sb = NULL;
+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+   }
  
  	return err;

  }
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 6ae96c4486fe..7864edf17f19 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -61,6 +61,7 @@ enum {
   (for minor mgmt only) */
PLOOP_S_ONCE,   /* An event (e.g. printk once) happened */
PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */
+   PLOOP_S_FROZEN  /* Frozen PLOOP_IOC_FREEZE */
  };
  
  struct ploop_snapdata


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler

2016-07-15 Thread Maxim Patlasov

On 07/15/2016 01:10 AM, Vladimir Davydov wrote:

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index d52975eaaa36..3dc94ca5c393 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4839,11 +4839,12 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
if (!test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
  
+	plo->sb = NULL;

+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+
+   mutex_unlock(&plo->ctl_mutex);


Since this point nothing in ploop state hints that the device is still 
frozen. Someone (another instance of backup tool?) may mistakenly try to 
freeze it again (before we call thaw_bdev) and succeed in it (because 
the S_FROZEN bit is already cleared). The result would be "double 
freeze" that we tried to avoid by initial patch. The fix must be simple, 
I'll send a patch soon.


Thanks,
Maxim


err = thaw_bdev(bdev, sb);
-   if (!err) {
-   plo->sb = NULL;
-   clear_bit(PLOOP_S_FROZEN, &plo->state);
-   }
+   mutex_lock(&plo->ctl_mutex);
  
  	return err;

  }


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ploop: fix freeze/thaw ioctls

2016-07-15 Thread Maxim Patlasov
Current implementation suffers from several problems:

1) If someone, e.g. another instance of push-backup tool, mistakenly
attempts to freeze ploop while its thawing is in progress, we can
end up in double freeze.
2) After initiating thawing, no way to find out it by sysctl or /sys.
3) Handling PLOOP_S_FROZEN bit is not synchronized with ploop STOP/CLEAR
ioctls. It's not nice if ploop releases bdev keeping it in frozen state.

The patch fixes the above in straightforward way: more descriptive
plo->freeze_state, visible via /sys/block/ploopN/pstate/freeze_state, and
special checks in ioctl-s to ensure that freeze/thaw is allowed only on
running ploops and that thaw must preceed ploop STOP.

https://jira.sw.ru/browse/PSBM-49699

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |   34 ++
 drivers/block/ploop/sysfs.c |6 ++
 include/linux/ploop/ploop.h |8 +++-
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 3dc94ca..81d463f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3905,6 +3905,13 @@ static int ploop_stop(struct ploop_device * plo, struct 
block_device *bdev)
return -EBUSY;
}
 
+   if (plo->freeze_state != PLOOP_F_NORMAL) {
+   if (printk_ratelimit())
+   printk(KERN_INFO "stop ploop%d failed 
(freeze_state=%d)\n",
+  plo->index, plo->freeze_state);
+   return -EBUSY;
+   }
+
clear_bit(PLOOP_S_PUSH_BACKUP, &plo->state);
ploop_pb_stop(plo->pbd, true);
 
@@ -4819,15 +4826,21 @@ static int ploop_freeze(struct ploop_device *plo, 
struct block_device *bdev)
 {
struct super_block *sb = plo->sb;
 
-   if (test_bit(PLOOP_S_FROZEN, &plo->state))
+   if (!test_bit(PLOOP_S_RUNNING, &plo->state))
+   return -EINVAL;
+
+   if (plo->freeze_state == PLOOP_F_FROZEN)
return 0;
 
+   if (plo->freeze_state == PLOOP_F_THAWING)
+   return -EBUSY;
+
sb = freeze_bdev(bdev);
if (sb && IS_ERR(sb))
return PTR_ERR(sb);
 
plo->sb = sb;
-   set_bit(PLOOP_S_FROZEN, &plo->state);
+   plo->freeze_state = PLOOP_F_FROZEN;
return 0;
 }
 
@@ -4836,16 +4849,29 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
struct super_block *sb = plo->sb;
int err;
 
-   if (!test_bit(PLOOP_S_FROZEN, &plo->state))
+   if (!test_bit(PLOOP_S_RUNNING, &plo->state))
+   return -EINVAL;
+
+   if (plo->freeze_state == PLOOP_F_NORMAL)
return 0;
 
+   if (plo->freeze_state == PLOOP_F_THAWING)
+   return -EBUSY;
+
plo->sb = NULL;
-   clear_bit(PLOOP_S_FROZEN, &plo->state);
+   plo->freeze_state = PLOOP_F_THAWING;
 
mutex_unlock(&plo->ctl_mutex);
err = thaw_bdev(bdev, sb);
mutex_lock(&plo->ctl_mutex);
 
+   BUG_ON(plo->freeze_state != PLOOP_F_THAWING);
+
+   if (!err)
+   plo->freeze_state = PLOOP_F_NORMAL;
+   else
+   plo->freeze_state = PLOOP_F_FROZEN;
+
return err;
 }
 
diff --git a/drivers/block/ploop/sysfs.c b/drivers/block/ploop/sysfs.c
index d6dcc83..71b2a20 100644
--- a/drivers/block/ploop/sysfs.c
+++ b/drivers/block/ploop/sysfs.c
@@ -425,6 +425,11 @@ static ssize_t print_push_backup_uuid(struct ploop_device 
* plo, char * page)
return snprintf(page, PAGE_SIZE, "%pUB\n", uuid);
 }
 
+static u32 show_freeze_state(struct ploop_device * plo)
+{
+   return plo->freeze_state;
+}
+
 #define _TUNE_U32(_name)   \
 static u32 show_##_name(struct ploop_device * plo) \
 {  \
@@ -507,6 +512,7 @@ static struct attribute *state_attributes[] = {
_A3(cookie),
_A3(push_backup_uuid),
_A(open_count),
+   _A(freeze_state),
NULL
 };
 
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 7864edf..8ab4477 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -61,7 +61,12 @@ enum {
   (for minor mgmt only) */
PLOOP_S_ONCE,   /* An event (e.g. printk once) happened */
PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */
-   PLOOP_S_FROZEN  /* Frozen PLOOP_IOC_FREEZE */
+};
+
+enum {
+   PLOOP_F_NORMAL, /* Default: not yet freezed or unfrozen */
+   PLOOP_F_FROZEN, /* Frozen PLOOP_IOC_FREEZE */
+   PLOOP_F_THAWING,/* thaw_bdev is in progress */
 };
 
 struct ploop_snapdata
@@ -411,6 +416,7 @@ struct ploop_device
struct request_queue*queue;
   

Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-19 Thread Maxim Patlasov

Dima,


I have not heard from you since 07/06/2016. Do you agree with that 
reasoning I provided in last email? What's your objection against the 
patch now?



Thanks,

Maxim


On 07/06/2016 11:10 AM, Maxim Patlasov wrote:

Dima,

On 07/06/2016 04:58 AM, Dmitry Monakhov wrote:


Maxim Patlasov  writes:


Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):

This optimization becomes more important due to the fact that 
customers tend to use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:


 1->read_data_from_old_post
 2->write_to_new_pos
   ->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

NAK from me. What is advantage of this patch?


The advantage is the following: in case of BAT multi-updates, instead 
of doing many fsync-s (one per dio_post_submit), we'll do only one 
(when final ->write_page is called).



Does it makes code more optimal? No


Yes, it does. In the same sense as 9f860e606: saving some fsync-s.


Does it makes main ploop more asynchronous? No.


Correct, the patch optimizes ploop in the other way. It's not about 
making ploop more asynchronous.





If you want to make optimization then it is reasonable to
queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
before processing PLOOP_E_DATA_WBI  state for  preq with FUA
So sequence will looks like follows:
->sumbit_alloc
   ->submit_pad
   ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
->ploop_req_state_process
   case PLOOP_E_DATA_WBI:
   if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
   preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
   list_add_tail(&preq->list, &top_io->fsync_queue)
   return;
}
##Let fsync_thread do it's work
->ploop_req_state_process
case LOOP_E_DATA_WBI:
update_index->write_page with FUA (FLUSH is not required because 
we  already done fsync)


That's another type of optimization: making ploop more asynchronous. I 
thought about it, but didn't come to conclusion whether it's worthy 
w.r.t. adding more complexity to ploop-state-machine and possible bugs 
introduced with that.


Thanks,
Maxim




https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
  drivers/block/ploop/dev.c   |4 ++--
  drivers/block/ploop/io_direct.c |   25 -
  drivers/block/ploop/io_kaio.c   |3 ++-
  drivers/block/ploop/map.c   |   17 -
  include/linux/ploop/ploop.h |3 ++-
  5 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b..40768b6 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device 
* plo)

  preq->bl.tail = preq->bl.head = NULL;
  preq->req_cluster = 0;
  preq->req_size = 0;
-preq->req_rw = WRITE_SYNC|REQ_FUA;
+preq->req_rw = WRITE_SYNC;
  preq->eng_state = PLOOP_E_ENTRY;
  preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
  preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct 
ploop_device *plo)

  preq->bl.tail = preq->bl.head = NULL;
  preq->req_cluster = ~0U; /* uninitialized */
  preq->req_size = 0;
-preq->req_rw = WRITE_SYNC|REQ_FUA;
+preq->req_rw = WRITE_SYNC;
  preq->eng_state = PLOOP_E_ENTRY;
  preq->state = (1 << PLOOP_REQ_SYNC) | (1 << 
PLOOP_REQ_RELOC_S);

  preq->error = 0;
diff --git a/drivers/block/ploop/io_direct.c 
b/drivers/block/ploop/io_direct.c

index 1086850..0a5fb15 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,

static void
  dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-   struct page * page, sector_t sec, unsigned long rw)
+   struct page * page, sector_t sec, unsigned long rw,
+   int do_fsync_if_delayed)
  {
  if (!(io->files.file->f_mode & FMODE_WRITE)) {
  PLOOP_FAIL_REQUEST(pre

[Devel] [PATCH] fuse: fsync() did not return IO errors

2016-07-19 Thread Maxim Patlasov
From: Alexey Kuznetsov 

Due to implementation of fuse writeback filemap_write_and_wait_range()
does not catch errors. We have to do this directly after fuse_sync_writes()

Signed-off-by: Alexey Kuznetsov 
Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9154f86..ad1da83 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -462,6 +462,21 @@ int fuse_fsync_common(struct file *file, loff_t start, 
loff_t end,
goto out;
 
fuse_sync_writes(inode);
+
+   /*
+* Due to implementation of fuse writeback
+* filemap_write_and_wait_range() does not catch errors.
+* We have to do this directly after fuse_sync_writes()
+*/
+   if (test_bit(AS_ENOSPC, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_bit(AS_EIO, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   goto out;
+
err = sync_inode_metadata(inode, 1);
if (err)
goto out;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fuse: fuse_flush must check mapping->flags for errors

2016-07-19 Thread Maxim Patlasov
fuse_flush() calls filemap_write_and_wait() and checks
its status, but actual writeback will happen later, on
fuse_sync_writes(). If an error happens, fuse_writepage_end()
will set error bit in mapping->flags. So, we have to check
mapping->flags after fuse_sync_writes().

https://jira.sw.ru/browse/PSBM-49821

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0ef7fe1..5e73dd0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -599,6 +599,13 @@ static int fuse_flush(struct file *file, fl_owner_t id)
fuse_sync_writes(inode);
mutex_unlock(&inode->i_mutex);
 
+   if (test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   return err;
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] fuse: fuse_flush must check mapping->flags for errors

2016-07-19 Thread Maxim Patlasov
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad1da83..b43401e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -413,6 +413,15 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (err)
return err;
 
+   if (test_bit(AS_ENOSPC, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_bit(AS_EIO, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   return err;
+
inode_lock(inode);
fuse_sync_writes(inode);
inode_unlock(inode);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2] fuse: fuse_flush must check mapping->flags for errors

2016-07-19 Thread Maxim Patlasov
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().

Changed in v2:
 - fixed silly type: check must be *after* fuse_sync_writes()

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad1da83..6cac3dc 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -417,6 +417,15 @@ static int fuse_flush(struct file *file, fl_owner_t id)
fuse_sync_writes(inode);
inode_unlock(inode);
 
+   if (test_bit(AS_ENOSPC, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_ENOSPC, &file->f_mapping->flags))
+   err = -ENOSPC;
+   if (test_bit(AS_EIO, &file->f_mapping->flags) &&
+   test_and_clear_bit(AS_EIO, &file->f_mapping->flags))
+   err = -EIO;
+   if (err)
+   return err;
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/3] ploop: factor ->write_page() out

2016-07-20 Thread Maxim Patlasov
Simple re-work. No logic changed. Will be useful for the next patch.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/map.c   |   39 +--
 include/linux/ploop/ploop.h |2 ++
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 1883674..5f7fd66 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -896,6 +896,25 @@ static void copy_index_for_wb(struct page * page, struct 
map_node * m, int level
}
 }
 
+
+void ploop_index_wb_proceed(struct ploop_request * preq)
+{
+   struct map_node * m = preq->map;
+   struct ploop_delta * top_delta = map_top_delta(m->parent);
+   struct page * page = preq->sinfo.wi.tpage;
+   unsigned long rw = preq->req_index_update_rw;
+   sector_t sec;
+
+   preq->eng_state = PLOOP_E_INDEX_WB;
+
+   top_delta->ops->map_index(top_delta, m->mn_start, &sec);
+
+   __TRACE("wbi-proceed %p %u %p\n", preq, preq->req_cluster, m);
+   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, rw);
+
+   put_page(page);
+}
+
 /* Data write is commited. Now we need to update index. */
 
 void ploop_index_update(struct ploop_request * preq)
@@ -907,8 +926,6 @@ void ploop_index_update(struct ploop_request * preq)
map_index_t blk;
int old_level;
struct page * page;
-   sector_t sec;
-   unsigned long rw;
unsigned long state = READ_ONCE(preq->state);
 
/* No way back, we are going to initiate index write. */
@@ -955,15 +972,13 @@ void ploop_index_update(struct ploop_request * preq)
 
((map_index_t*)page_address(page))[idx] = preq->iblock << 
ploop_map_log(plo);
 
-   preq->eng_state = PLOOP_E_INDEX_WB;
get_page(page);
preq->sinfo.wi.tpage = page;
 
__TRACE("wbi %p %u %p\n", preq, preq->req_cluster, m);
plo->st.map_single_writes++;
-   top_delta->ops->map_index(top_delta, m->mn_start, &sec);
 
-   rw = (preq->req_rw & (REQ_FUA | REQ_FLUSH));
+   preq->req_index_update_rw = (preq->req_rw & (REQ_FUA | REQ_FLUSH));
 
/* We've just set REQ_FLUSH in rw, ->write_page() below
   will do the FLUSH */
@@ -971,11 +986,9 @@ void ploop_index_update(struct ploop_request * preq)
 
/* Relocate requires consistent index update */
if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL))
-   rw |= (REQ_FLUSH | REQ_FUA);
+   preq->req_index_update_rw |= (REQ_FLUSH | REQ_FUA);
 
-   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, rw);
-
-   put_page(page);
+   ploop_index_wb_proceed(preq);
return;
 
 enomem:
@@ -991,6 +1004,7 @@ out:
 }
 EXPORT_SYMBOL(ploop_index_update);
 
+
 int map_index(struct ploop_delta * delta, struct ploop_request * preq, 
unsigned long *sec)
 {
return delta->ops->map_index(delta, preq->map->mn_start, sec);
@@ -1094,7 +1108,6 @@ static void map_wb_complete(struct map_node * m, int err)
struct page * page = NULL;
int delayed = 0;
unsigned int idx;
-   sector_t sec;
unsigned long rw;
 
/* First, complete processing of written back indices,
@@ -1219,11 +1232,9 @@ static void map_wb_complete(struct map_node * m, int err)
 
__TRACE("wbi2 %p %u %p\n", main_preq, main_preq->req_cluster, m);
plo->st.map_multi_writes++;
-   top_delta->ops->map_index(top_delta, m->mn_start, &sec);
 
-   top_delta->io.ops->write_page(&top_delta->io, main_preq, page, sec,
- rw);
-   put_page(page);
+   main_preq->req_index_update_rw = rw;
+   ploop_index_wb_proceed(main_preq);
 }
 
 void
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 7864edf..3d52f28 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -534,6 +534,7 @@ struct ploop_request
sector_treq_sector;
unsigned intreq_size;
unsigned intreq_rw;
+   unsigned intreq_index_update_rw;
unsigned long   tstamp;
struct io_context   *ioc;
 
@@ -803,6 +804,7 @@ void map_init(struct ploop_device *, struct ploop_map * 
map);
 void ploop_map_start(struct ploop_map * map, u64 bd_size);
 void ploop_map_destroy(struct ploop_map * map);
 void ploop_map_remove_delta(struct ploop_map * map, int level);
+void ploop_index_wb_proceed(struct ploop_request * preq);
 void ploop_index_update(struct ploop_request * preq);
 void ploop_index_wb_complete(struct ploop_request * preq);
 int __init ploop_map_init(void);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/3] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests (v2)

2016-07-20 Thread Maxim Patlasov
Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):

> This optimization becomes more important due to the fact that customers tend 
> to use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:

> 1->read_data_from_old_post
> 2->write_to_new_pos
>   ->sumbit_alloc
>  ->submit_pad
>  ->post_submit->convert_unwritten
> 3->update_index ->write_page with FLUSH|FUA
> 4->nullify_old_pos
>5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

Changed in v2:
 - process delayed fsync asynchronously, via PLOOP_E_FSYNC_PENDED eng_state

https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |9 +++--
 drivers/block/ploop/map.c   |   33 +
 include/linux/ploop/ploop.h |2 ++
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index df3eec9..ed60b1f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2720,6 +2720,11 @@ restart:
ploop_index_wb_complete(preq);
break;
 
+   case PLOOP_E_FSYNC_PENDED:
+   /* fsync done */
+   ploop_index_wb_proceed(preq);
+   break;
+
default:
BUG();
}
@@ -4106,7 +4111,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4410,7 +4415,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 5f7fd66..01e1064 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -915,6 +915,23 @@ void ploop_index_wb_proceed(struct ploop_request * preq)
put_page(page);
 }
 
+static void ploop_index_wb_proceed_or_delay(struct ploop_request * preq)
+{
+   if (test_and_clear_bit(PLOOP_REQ_FSYNC_IF_DELAYED, &preq->state)) {
+   struct map_node * m = preq->map;
+   struct ploop_delta * top_delta = map_top_delta(m->parent);
+   struct ploop_io * top_io = &top_delta->io;
+
+   if (test_bit(PLOOP_IO_FSYNC_DELAYED, &top_io->io_state)) {
+   preq->eng_state = PLOOP_E_FSYNC_PENDED;
+   ploop_add_req_to_fsync_queue(preq);
+   return;
+   }
+   }
+
+   ploop_index_wb_proceed(preq);
+}
+
 /* Data write is commited. Now we need to update index. */
 
 void ploop_index_update(struct ploop_request * preq)
@@ -985,10 +1002,12 @@ void ploop_index_update(struct ploop_request * preq)
preq->req_rw &= ~REQ_FLUSH;
 
/* Relocate requires consistent index update */
-   if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL))
+   if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL)) {
preq->req_index_update_rw |= (REQ_FLUSH | REQ_FUA);
+   set_bit(PLOOP_REQ_FSYNC_IF_DELAYED, &preq->state);
+   }
 
-   ploop_index_wb_proceed(preq);
+   ploop_index_wb_proceed_or_delay(preq);
return;
 
 enomem:
@@ -1109,6 +1128,7 @@ static void map_wb_complete(struct map_node * m, int err)
int delayed = 0;
unsigned int idx;
unsigned long rw;
+   int do_fsync_if_delayed = 0;
 
/* First, complete processing of written back indices,
 * finally instantiate indices in mapping cache.
@@ -1206,8 +1226,10 @@ static void map_wb_complete(struct map_node * m, int err)
 
state = READ_ONCE(preq->state);
/* Rel

[Devel] [PATCH rh7 2/3] ploop: factor out add preq to fsync_queue

2016-07-20 Thread Maxim Patlasov
Simple re-work. No logic changed. Will be useful for the next patch.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |   19 ++-
 include/linux/ploop/ploop.h |1 +
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 3dc94ca..df3eec9 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2028,6 +2028,19 @@ static inline bool preq_is_special(struct ploop_request 
* preq)
PLOOP_REQ_ZERO_FL);
 }
 
+void ploop_add_req_to_fsync_queue(struct ploop_request * preq)
+{
+   struct ploop_device * plo   = preq->plo;
+   struct ploop_delta  * top_delta = ploop_top_delta(plo);
+   struct ploop_io * top_io= &top_delta->io;
+
+   spin_lock_irq(&plo->lock);
+   list_add_tail(&preq->list, &top_io->fsync_queue);
+   if (waitqueue_active(&top_io->fsync_waitq))
+   wake_up_interruptible(&top_io->fsync_waitq);
+   spin_unlock_irq(&plo->lock);
+}
+
 static void
 ploop_entry_request(struct ploop_request * preq)
 {
@@ -2053,11 +2066,7 @@ ploop_entry_request(struct ploop_request * preq)
if ((preq->req_rw & REQ_FLUSH) &&
test_bit(PLOOP_IO_FSYNC_DELAYED, &top_io->io_state) &&
!test_bit(PLOOP_REQ_FSYNC_DONE, &preq->state)) {
-   spin_lock_irq(&plo->lock);
-   list_add_tail(&preq->list, &top_io->fsync_queue);
-   if (waitqueue_active(&top_io->fsync_waitq))
-   wake_up_interruptible(&top_io->fsync_waitq);
-   spin_unlock_irq(&plo->lock);
+   ploop_add_req_to_fsync_queue(preq);
return;
}
 
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 3d52f28..d8e01b6 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -809,6 +809,7 @@ void ploop_index_update(struct ploop_request * preq);
 void ploop_index_wb_complete(struct ploop_request * preq);
 int __init ploop_map_init(void);
 void ploop_map_exit(void);
+void ploop_add_req_to_fsync_queue(struct ploop_request * preq);
 
 
 void ploop_quiesce(struct ploop_device * plo);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/3] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests (v3)

2016-07-20 Thread Maxim Patlasov
Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):

> This optimization becomes more important due to the fact that customers tend 
> to use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:

> 1->read_data_from_old_post
> 2->write_to_new_pos
>   ->sumbit_alloc
>  ->submit_pad
>  ->post_submit->convert_unwritten
> 3->update_index ->write_page with FLUSH|FUA
> 4->nullify_old_pos
>5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

Changed in v2:
 - process delayed fsync asynchronously, via PLOOP_E_FSYNC_PENDED eng_state

Changed in v3:
 - use extra arg for ploop_index_wb_proceed_or_delay() instead of ad-hoc 
PLOOP_REQ_FSYNC_IF_DELAYED

https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/dev.c   |9 +++--
 drivers/block/ploop/map.c   |   32 
 include/linux/ploop/ploop.h |1 +
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index df3eec9..ed60b1f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2720,6 +2720,11 @@ restart:
ploop_index_wb_complete(preq);
break;
 
+   case PLOOP_E_FSYNC_PENDED:
+   /* fsync done */
+   ploop_index_wb_proceed(preq);
+   break;
+
default:
BUG();
}
@@ -4106,7 +4111,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4410,7 +4415,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 5f7fd66..715dc15 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -915,6 +915,24 @@ void ploop_index_wb_proceed(struct ploop_request * preq)
put_page(page);
 }
 
+static void ploop_index_wb_proceed_or_delay(struct ploop_request * preq,
+   int do_fsync_if_delayed)
+{
+   if (do_fsync_if_delayed) {
+   struct map_node * m = preq->map;
+   struct ploop_delta * top_delta = map_top_delta(m->parent);
+   struct ploop_io * top_io = &top_delta->io;
+
+   if (test_bit(PLOOP_IO_FSYNC_DELAYED, &top_io->io_state)) {
+   preq->eng_state = PLOOP_E_FSYNC_PENDED;
+   ploop_add_req_to_fsync_queue(preq);
+   return;
+   }
+   }
+
+   ploop_index_wb_proceed(preq);
+}
+
 /* Data write is commited. Now we need to update index. */
 
 void ploop_index_update(struct ploop_request * preq)
@@ -927,6 +945,7 @@ void ploop_index_update(struct ploop_request * preq)
int old_level;
struct page * page;
unsigned long state = READ_ONCE(preq->state);
+   int do_fsync_if_delayed = 0;
 
/* No way back, we are going to initiate index write. */
 
@@ -985,10 +1004,12 @@ void ploop_index_update(struct ploop_request * preq)
preq->req_rw &= ~REQ_FLUSH;
 
/* Relocate requires consistent index update */
-   if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL))
+   if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL)) {
preq->req_index_update_rw |= (REQ_FLUSH | REQ_FUA);
+   do_fsync_if_delayed = 1;
+   }
 
-   ploop_index_wb_proceed(preq);
+   ploop_index_wb_proceed_or_delay(preq, do_fsync_if_delayed);
return;
 
 enomem:
@@ -1109,6 +1130,7 @@ static void map_wb_complete(struct map_node * m

[Devel] Bug 124651 - ext4 bugon panic when I mmap a file

2016-07-22 Thread Maxim Patlasov

Dima,


Just in case, does this:


https://bugzilla.kernel.org/show_bug.cgi?id=124651


affect us?


Thanks,

Maxim

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ovl: verify upper dentry in ovl_remove_and_whiteout()

2016-07-22 Thread Maxim Patlasov
Backport from git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git:

commit cfc9fde0b07c3b44b570057c5f93dda59dca1c94
Author: Maxim Patlasov 
Date:   Thu Jul 21 18:24:26 2016 -0700

ovl: verify upper dentry in ovl_remove_and_whiteout()

The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417d0 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov 
Signed-off-by: Miklos Szeredi 
Cc: 

https://jira.sw.ru/browse/PSBM-47981
---
 fs/overlayfs/dir.c |   54 +++-
 1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 229b9e4..5402b9b 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -511,6 +511,7 @@ static int ovl_remove_and_whiteout(struct dentry *dentry, 
bool is_dir)
struct dentry *upper;
struct dentry *opaquedir = NULL;
int err;
+   int flags = 0;
 
if (WARN_ON(!workdir))
return -EROFS;
@@ -540,46 +541,39 @@ static int ovl_remove_and_whiteout(struct dentry *dentry, 
bool is_dir)
if (err)
goto out_dput;
 
-   whiteout = ovl_whiteout(workdir, dentry);
-   err = PTR_ERR(whiteout);
-   if (IS_ERR(whiteout))
+   upper = lookup_one_len(dentry->d_name.name, upperdir,
+  dentry->d_name.len);
+   err = PTR_ERR(upper);
+   if (IS_ERR(upper))
goto out_unlock;
 
-   upper = ovl_dentry_upper(dentry);
-   if (!upper) {
-   upper = lookup_one_len(dentry->d_name.name, upperdir,
-  dentry->d_name.len);
-   err = PTR_ERR(upper);
-   if (IS_ERR(upper))
-   goto kill_whiteout;
-
-   err = ovl_do_rename(wdir, whiteout, udir, upper, 0);
-   dput(upper);
-   if (err)
-   goto kill_whiteout;
-   } else {
-   int flags = 0;
+   err = -ESTALE;
+   if ((opaquedir && upper != opaquedir) ||
+   (!opaquedir && ovl_dentry_upper(dentry) &&
+upper != ovl_dentry_upper(dentry))) {
+   goto out_dput_upper;
+   }
 
-   if (opaquedir)
-   upper = opaquedir;
-   err = -ESTALE;
-   if (upper->d_parent != upperdir)
-   goto kill_whiteout;
+   whiteout = ovl_whiteout(workdir, dentry);
+   err = PTR_ERR(whiteout);
+   if (IS_ERR(whiteout))
+   goto out_dput_upper;
 
-   if (is_dir)
-   flags |= RENAME_EXCHANGE;
+   if (d_is_dir(upper))
+   flags = RENAME_EXCHANGE;
 
-   err = ovl_do_rename(wdir, whiteout, udir, upper, flags);
-   if (err)
-   goto kill_whiteout;
+   err = ovl_do_rename(wdir, whiteout, udir, upper, flags);
+   if (err)
+   goto kill_whiteout;
+   if (flags)
+   ovl_cleanup(wdir, upper);
 
-   if (is_dir)
-   ovl_cleanup(wdir, upper);
-   }
ovl_dentry_version_inc(dentry->d_parent);
 out_d_drop:
d_drop(dentry);
dput(whiteout);
+out_dput_upper:
+   dput(upper);
 out_unlock:
unlock_rename(workdir, upperdir);
 out_dput:

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ext4: ext4_mkdir must set S_IOPS_WRAPPER bit

2016-07-25 Thread Maxim Patlasov
ext4_iget() sets this bit for directories. Let's do the same in ext4_mkdir().
Otherwise, the behaviour of vfs_rename (on top of ext4) varies depending on
how the in-core inode was born: via lookup or mkdir.

The key place in vfs_rename sensible to the change is:

>   if (flags && !rename2)
>   return -EINVAL;

Signed-off-by: Maxim Patlasov 
---
 fs/ext4/namei.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0adc6df..bebe698 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2413,6 +2413,7 @@ retry:
 
inode->i_op = &ext4_dir_inode_operations.ops;
inode->i_fop = &ext4_dir_operations;
+   inode->i_flags |= S_IOPS_WRAPPER;
err = ext4_init_new_dir(handle, dir, inode);
if (err)
goto out_clear_inode;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/4] overlayfs: introduce d_select_inode dentry operation

2016-07-26 Thread Maxim Patlasov
The patch is simplified defensive backport of upstream commit
4bacc9c9234c7c8eec44f5ed4e960d9f96fa0f01:

>overlayfs: Make f_path always point to the overlay and f_inode to the 
> underlay
>
>Make file->f_path always point to the overlay dentry so that the path in
>/proc/pid/fd is correct and to ensure that label-based LSMs have access to 
> the
>overlay as well as the underlay (path-based LSMs probably don't need it).
>...
>Signed-off-by: David Howells 
>Signed-off-by: Al Viro 

Original patch is prone to errors because other parts of linux kernel are not
prepared to such a change of semantics of f_path and f_inode. So, we only 
backport
simplified d_select_inode keeping f_path and f_inode intact.

https://jira.sw.ru/browse/PSBM-47981

Signed-off-by: Maxim Patlasov 
---
 fs/dcache.c  |5 -
 fs/overlayfs/inode.c |   15 +++
 fs/overlayfs/overlayfs.h |1 +
 fs/overlayfs/super.c |2 ++
 include/linux/dcache.h   |2 ++
 5 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6433814..7db8aef 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1635,7 +1635,8 @@ void d_set_d_op(struct dentry *dentry, const struct 
dentry_operations *op)
DCACHE_OP_COMPARE   |
DCACHE_OP_REVALIDATE|
DCACHE_OP_WEAK_REVALIDATE   |
-   DCACHE_OP_DELETE ));
+   DCACHE_OP_DELETE|
+   DCACHE_OP_SELECT_INODE));
dentry->d_op = op;
if (!op)
return;
@@ -1651,6 +1652,8 @@ void d_set_d_op(struct dentry *dentry, const struct 
dentry_operations *op)
dentry->d_flags |= DCACHE_OP_DELETE;
if (op->d_prune)
dentry->d_flags |= DCACHE_OP_PRUNE;
+   if (op->d_select_inode)
+   dentry->d_flags |= DCACHE_OP_SELECT_INODE;
 
 }
 EXPORT_SYMBOL(d_set_d_op);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 1d2c32f..5fe7acf 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -368,6 +368,21 @@ out:
return err;
 }
 
+struct inode *ovl_d_select_inode(struct dentry *dentry)
+{
+   struct path realpath;
+
+   if (d_is_dir(dentry))
+   return d_backing_inode(dentry);
+
+   ovl_path_real(dentry, &realpath);
+
+   if (realpath.dentry->d_flags & DCACHE_OP_SELECT_INODE)
+   return realpath.dentry->d_op->d_select_inode(realpath.dentry);
+
+   return d_backing_inode(realpath.dentry);
+}
+
 static const struct inode_operations_wrapper ovl_file_inode_operations = {
.ops = {
.setattr= ovl_setattr,
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 45d183b..8da9684 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -173,6 +173,7 @@ ssize_t ovl_getxattr(struct dentry *dentry, const char 
*name,
 void *value, size_t size);
 ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
 int ovl_removexattr(struct dentry *dentry, const char *name);
+struct inode *ovl_d_select_inode(struct dentry *dentry);
 
 struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
struct ovl_entry *oe);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index d5c57b4..24ec90b 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -316,10 +316,12 @@ static int ovl_dentry_weak_revalidate(struct dentry 
*dentry, unsigned int flags)
 
 static const struct dentry_operations ovl_dentry_operations = {
.d_release = ovl_dentry_release,
+   .d_select_inode = ovl_d_select_inode,
 };
 
 static const struct dentry_operations ovl_reval_dentry_operations = {
.d_release = ovl_dentry_release,
+   .d_select_inode = ovl_d_select_inode,
.d_revalidate = ovl_dentry_revalidate,
.d_weak_revalidate = ovl_dentry_weak_revalidate,
 };
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2f6e4d8..267dbc6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -160,6 +160,7 @@ struct dentry_operations {
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(struct dentry *, bool);
+   struct inode *(*d_select_inode)(struct dentry *);
 } cacheline_aligned;
 
 /*
@@ -221,6 +222,7 @@ struct dentry_operations {
 #define DCACHE_FILE_TYPE   0x0400 /* Other file type */
 
 #define DCACHE_MAY_FREE0x0080
+#define DCACHE_OP_SELECT_INODE 0x2000 /* Unioned entry: dcache op 
selects inode */
 
 extern seqlock_t rename_lock;
 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/4] overlayfs: missing detection of hardlinks in vfs_rename() on overlayfs

2016-07-26 Thread Maxim Patlasov
As rightly explained in CVE-2016-6198
(https://bugzilla.redhat.com/show_bug.cgi?id=1355654):

> It was found that the vfs_rename() function did not detect hard links on
> overlayfs. A local, unprivileged user could use the rename syscall on
> overlayfs on top of xfs to crash the system.

The series backport necessary bits from upstream to fix it.

---

Maxim Patlasov (4):
  VFS: Introduce inode-getting helpers for layered/unioned fs environments
  overlayfs: introduce d_select_inode dentry operation
  vfs: add vfs_select_inode() helper
  vfs: rename: check backing inode being equal


 fs/dcache.c  |5 +++
 fs/namei.c   |6 +++-
 fs/overlayfs/inode.c |   15 ++
 fs/overlayfs/overlayfs.h |1 +
 fs/overlayfs/super.c |2 +
 include/linux/dcache.h   |   69 ++
 6 files changed, 96 insertions(+), 2 deletions(-)

--
Signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/4] vfs: add vfs_select_inode() helper

2016-07-26 Thread Maxim Patlasov
The patch backports upstream commit 54d5ca871e72f2bb172ec9323497f01cd5091ec7:

>vfs: add vfs_select_inode() helper
>
>Signed-off-by: Miklos Szeredi 

The part about vfs_open is omitted because we don't use
d_op->d_select_inode() there. Our version of vfs_select_inode()
doesn't have "open_flags" arg because our d_select_inode()
doesn't have it.

https://jira.sw.ru/browse/PSBM-47981

Signed-off-by: Maxim Patlasov 
---
 include/linux/dcache.h |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 267dbc6..897814a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -529,4 +529,14 @@ static inline struct dentry *d_backing_dentry(struct 
dentry *upper)
return upper;
 }
 
+static inline struct inode *vfs_select_inode(struct dentry *dentry)
+{
+   struct inode *inode = d_inode(dentry);
+
+   if (inode && unlikely(dentry->d_flags & DCACHE_OP_SELECT_INODE))
+   inode = dentry->d_op->d_select_inode(dentry);
+
+   return inode;
+}
+
 #endif /* __LINUX_DCACHE_H */

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/4] vfs: rename: check backing inode being equal

2016-07-26 Thread Maxim Patlasov
The patch backports upstream commit 9409e22acdfc9153f88d9b1ed2bd2a5b34d2d3ca:

> vfs: rename: check backing inode being equal
>
> If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
> should do nothing and return success.
>
> This condition is checked in vfs_rename().  However it won't detect hard
> links on overlayfs where these are given separate inodes on the overlayfs
> layer.
>
> Overlayfs itself detects this condition and returns success without doing
> anything, but then vfs_rename() will proceed as if this was a successful
> rename (detach_mounts(), d_move()).
>
> The correct thing to do is to detect this condition before even calling
> into overlayfs.  This patch does this by calling vfs_select_inode() to get
> the underlying inodes.
>
> Signed-off-by: Miklos Szeredi 

https://jira.sw.ru/browse/PSBM-47981

Signed-off-by: Maxim Patlasov 
---
 fs/namei.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 16820b1..427c740 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4103,7 +4103,11 @@ int vfs_rename(struct inode *old_dir, struct dentry 
*old_dentry,
unsigned max_links = new_dir->i_sb->s_max_links;
iop_rename2_t rename2;
 
-   if (source == target)
+   /*
+* Check source == target.
+* On overlayfs need to look at underlying inodes.
+*/
+   if (vfs_select_inode(old_dentry) == vfs_select_inode(new_dentry))
return 0;
 
error = may_delete(old_dir, old_dentry, is_dir);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/4] VFS: Introduce inode-getting helpers for layered/unioned fs environments

2016-07-26 Thread Maxim Patlasov
The patch backports upstream commit 155e35d4daa804582f75acaa2c74ec797a89c615:

> VFS: Introduce inode-getting helpers for layered/unioned fs environments
>
> Introduce some function for getting the inode (and also the dentry) in an
> environment where layered/unioned filesystems are in operation.
>
> The problem is that we have places where we need *both* the union dentry and
> the lower source or workspace inode or dentry available, but we can only have
> a handle on one of them.  Therefore we need to derive the handle to the other
> from that.
>
> The idea is to introduce an extra field in struct dentry that allows the union
> dentry to refer to and pin the lower dentry.
>
> Signed-off-by: David Howells 
> Signed-off-by: Al Viro 

https://jira.sw.ru/browse/PSBM-47981

Signed-off-by: Maxim Patlasov 
---
 include/linux/dcache.h |   57 
 1 file changed, 57 insertions(+)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index b55fb2e..2f6e4d8 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -470,4 +470,61 @@ static inline unsigned long vfs_pressure_ratio(unsigned 
long val)
 {
return mult_frac(val, sysctl_vfs_cache_pressure, 100);
 }
+
+/**
+ * d_inode - Get the actual inode of this dentry
+ * @dentry: The dentry to query
+ *
+ * This is the helper normal filesystems should use to get at their own inodes
+ * in their own dentries and ignore the layering superimposed upon them.
+ */
+static inline struct inode *d_inode(const struct dentry *dentry)
+{
+   return dentry->d_inode;
+}
+
+/**
+ * d_inode_rcu - Get the actual inode of this dentry with ACCESS_ONCE()
+ * @dentry: The dentry to query
+ *
+ * This is the helper normal filesystems should use to get at their own inodes
+ * in their own dentries and ignore the layering superimposed upon them.
+ */
+static inline struct inode *d_inode_rcu(const struct dentry *dentry)
+{
+   return ACCESS_ONCE(dentry->d_inode);
+}
+
+/**
+ * d_backing_inode - Get upper or lower inode we should be using
+ * @upper: The upper layer
+ *
+ * This is the helper that should be used to get at the inode that will be used
+ * if this dentry were to be opened as a file.  The inode may be on the upper
+ * dentry or it may be on a lower dentry pinned by the upper.
+ *
+ * Normal filesystems should not use this to access their own inodes.
+ */
+static inline struct inode *d_backing_inode(const struct dentry *upper)
+{
+   struct inode *inode = upper->d_inode;
+
+   return inode;
+}
+
+/**
+ * d_backing_dentry - Get upper or lower dentry we should be using
+ * @upper: The upper layer
+ *
+ * This is the helper that should be used to get the dentry of the inode that
+ * will be used if this dentry were opened as a file.  It may be the upper
+ * dentry or it may be a lower dentry pinned by the upper.
+ *
+ * Normal filesystems should not use this to access their own dentries.
+ */
+static inline struct dentry *d_backing_dentry(struct dentry *upper)
+{
+   return upper;
+}
+
 #endif /* __LINUX_DCACHE_H */

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 3/3] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests (v3)

2016-07-27 Thread Maxim Patlasov

Dima,


One week elapsed, still no feedback from you. Do you have something 
against this patch?



Thanks,

Maxim


On 07/20/2016 11:21 PM, Maxim Patlasov wrote:

Commit 9f860e606 introduced an engine to delay fsync: doing
fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
later, when incoming FLUSH|FUA comes.

That was deemed as important because (PSBM-47026):


This optimization becomes more important due to the fact that customers tend to 
use pcompact heavily => ploop images grow each day.

Now, we can easily re-use the engine to delay fsync for reloc
requests as well. As explained in the description of commit
5aa3fe09:


 1->read_data_from_old_post
 2->write_to_new_pos
   ->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
5->issue_flush

by the time of step 3 extent coversion is not yet stable because
belongs to uncommitted transaction. But instead of doing fsync
inside ->post_submit, we can fsync later, as the very first step
of write_page for index_update.

Changed in v2:
  - process delayed fsync asynchronously, via PLOOP_E_FSYNC_PENDED eng_state

Changed in v3:
  - use extra arg for ploop_index_wb_proceed_or_delay() instead of ad-hoc 
PLOOP_REQ_FSYNC_IF_DELAYED

https://jira.sw.ru/browse/PSBM-47026

Signed-off-by: Maxim Patlasov 
---
  drivers/block/ploop/dev.c   |9 +++--
  drivers/block/ploop/map.c   |   32 
  include/linux/ploop/ploop.h |1 +
  3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index df3eec9..ed60b1f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2720,6 +2720,11 @@ restart:
ploop_index_wb_complete(preq);
break;
  
+	case PLOOP_E_FSYNC_PENDED:

+   /* fsync done */
+   ploop_index_wb_proceed(preq);
+   break;
+
default:
BUG();
}
@@ -4106,7 +4111,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4410,7 +4415,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC|REQ_FUA;
+   preq->req_rw = WRITE_SYNC;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 5f7fd66..715dc15 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -915,6 +915,24 @@ void ploop_index_wb_proceed(struct ploop_request * preq)
put_page(page);
  }
  
+static void ploop_index_wb_proceed_or_delay(struct ploop_request * preq,

+   int do_fsync_if_delayed)
+{
+   if (do_fsync_if_delayed) {
+   struct map_node * m = preq->map;
+   struct ploop_delta * top_delta = map_top_delta(m->parent);
+   struct ploop_io * top_io = &top_delta->io;
+
+   if (test_bit(PLOOP_IO_FSYNC_DELAYED, &top_io->io_state)) {
+   preq->eng_state = PLOOP_E_FSYNC_PENDED;
+   ploop_add_req_to_fsync_queue(preq);
+   return;
+   }
+   }
+
+   ploop_index_wb_proceed(preq);
+}
+
  /* Data write is commited. Now we need to update index. */
  
  void ploop_index_update(struct ploop_request * preq)

@@ -927,6 +945,7 @@ void ploop_index_update(struct ploop_request * preq)
int old_level;
struct page * page;
unsigned long state = READ_ONCE(preq->state);
+   int do_fsync_if_delayed = 0;
  
  	/* No way back, we are going to initiate index write. */
  
@@ -985,10 +1004,12 @@ void ploop_index_update(struct ploop_request * preq)

preq->req_rw &= ~REQ_FLUSH;
  
  	/* Relocate requires consistent index update */

-   if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL))
+   if (state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL)) {
preq->req_index_update_rw |= (REQ_FLUSH | REQ_FUA);
+   do_fsync_if_delayed = 1;
+   }
  
-	ploop_index_wb_proceed(preq);

+   ploop_index_wb_proceed_or_delay(pre

Re: [Devel] [PATCH rh7] ext4: ext4_mkdir must set S_IOPS_WRAPPER bit

2016-07-29 Thread Maxim Patlasov
Kostya, ms is not affected,  RedHat bz ticket: 
https://bugzilla.redhat.com/show_bug.cgi?id=1361682



On 07/29/2016 08:15 AM, Konstantin Khorenko wrote:

Maxim, will you send the patch to mainstream as well?

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 07/26/2016 12:01 AM, Maxim Patlasov wrote:
ext4_iget() sets this bit for directories. Let's do the same in 
ext4_mkdir().
Otherwise, the behaviour of vfs_rename (on top of ext4) varies 
depending on

how the in-core inode was born: via lookup or mkdir.

The key place in vfs_rename sensible to the change is:


if (flags && !rename2)
return -EINVAL;


Signed-off-by: Maxim Patlasov 
---
 fs/ext4/namei.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0adc6df..bebe698 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2413,6 +2413,7 @@ retry:

 inode->i_op = &ext4_dir_inode_operations.ops;
 inode->i_fop = &ext4_dir_operations;
+inode->i_flags |= S_IOPS_WRAPPER;
 err = ext4_init_new_dir(handle, dir, inode);
 if (err)
 goto out_clear_inode;

.



___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] overlayfs: fix dentry reference leak

2016-07-29 Thread Maxim Patlasov
Without this patch it is easy to crash node by fiddling
with overlayfs dirs. Backport commit ab79efab0 from ms:

From: David Howells 

In ovl_copy_up_locked(), newdentry is leaked if the function exits through
out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
actually release the ref on newdentry.

The out_cleanup segment should instead exit through out2 as certainly
newdentry leaks - and possibly upper does also, though this isn't caught
given the catch of newdentry.

Without this fix, something like the following is seen:

BUG: Dentry 880023e9eb20{i=f861,n=#880023e82d90} still in use (1) 
[unmount of tmpfs tmpfs]
BUG: Dentry 880023ece640{i=0,n=bigfile}  still in use (1) [unmount of 
tmpfs tmpfs]

when unmounting the upper layer after an error occurred in copyup.

An error can be induced by creating a big file in a lower layer with
something like:

dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000))

to create a large file (4.1G).  Overlay an upper layer that is too small
(on tmpfs might do) and then induce a copy up by opening it writably.

Reported-by: Ulrich Obergfell 
Signed-off-by: David Howells 
Signed-off-by: Miklos Szeredi 

https://jira.sw.ru/browse/PSBM-47981
---
 fs/overlayfs/copy_up.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 3f3d1b0..afed35c 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -299,7 +299,7 @@ out:
 
 out_cleanup:
ovl_cleanup(wdir, newdentry);
-   goto out;
+   goto out2;
 }
 
 /*

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: fix truncate_inode_pages_range() for filesystems without buffer-heads

2016-08-03 Thread Maxim Patlasov
File systems who don't use buffer-heads must not suffer from
the lack of ->invalidatepage_range() adress_space operation.
The logic of partial start/end truncation already implemented
in truncate_inode_pages_range() must suffice for them.

https://jira.sw.ru/browse/PSBM-50629

Signed-off-by: Maxim Patlasov 
---
 mm/truncate.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 8dcfe94..cc852aa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -274,6 +274,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
+   int bug_if_page_has_bh = 0;
 
cleancache_invalidate_inode(mapping);
if (mapping->nrpages == 0 && mapping->nrshadows == 0)
@@ -283,7 +284,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
partial_start = lstart & (PAGE_CACHE_SIZE - 1);
partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
if (!inode_has_invalidate_range(mapping->host))
-   BUG_ON(partial_end);
+   bug_if_page_has_bh = 1;
 
/*
 * 'start' and 'end' always covers the range of pages to be fully
@@ -368,9 +369,11 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
wait_on_page_writeback(page);
zero_user_segment(page, 0, partial_end);
cleancache_invalidate_page(mapping, page);
-   if (page_has_private(page))
+   if (page_has_private(page)) {
+   BUG_ON(bug_if_page_has_bh);
do_invalidatepage_range(page, 0,
partial_end);
+   }
unlock_page(page);
page_cache_release(page);
}

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH criu] files: Allow to dump ploopX files opened

2016-08-11 Thread Maxim Patlasov

On 08/04/2016 06:43 AM, Cyrill Gorcunov wrote:


On Thu, Aug 04, 2016 at 04:34:57PM +0300, Pavel Emelyanov wrote:
  
+static int check_blkdev(struct fd_parms *p, int lfd)

+{
+   /*
+* @ploop_major is module parameter actually,
+* set to PLOOP_DEVICE_MAJOR by default. We may
+* need to scan module params or access
+* /sys/block/ploopX/dev to fetch major.
+*
+* For a while simply use predefined @major.
+*/
+   static const int ploop_major = 182;

Major numbers are typically macro-defined and sit in some header ;)

I don't want to spread Vz7 specific code into headers and such.


+   int maj = major(p->stat.st_rdev);
+
+   /*
+* It's been found that systemd-udevd sometimes
+* opens-up ploop device from inside of container,
+* so allow him to do that.
+*/
+   if (maj == ploop_major)
+   return 0;

This worries me :( Ploop has some internal state, so if we catch a proggie
that configures that state, live migrate it and just re-open the ploop it
will __continue__ configuring that state and, since we've re-opened the
ploop from the beginning, this configuration will continue with an error...

Then we need to liftup this code and someone from ploop camp should help me
to gather ploop props which we could dump into image and restore then.


Assuming that we refuse to checkpoint ploop if its maintenance state != 
OFF, the following:

/sys/block/ploopX/pstate/*
/sys/block/ploopX/pdelta/*/*
may be enough. If it's not, let me know, and we'll think how to fix it.

Btw, please keep all the code interacting with ploop (ioctls, 
/sys/block/ploopX/*) in libploop.


Thanks,
Maxim


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ploop: add support for dm-crypted ploops

2016-08-17 Thread Maxim Patlasov

Andrey,


ploop_freeze() must use __ploop_get_dm_crypt_bdev(), not 
ploop_get_dm_crypt_bdev()!



Another problem is that the patch does not ensure that 
plo->dm_crypt_bdev stays the same between PLOOP_IOC_FREEZE/THAW. I could 
easily get:


[  636.600190] BUG: unable to handle kernel NULL pointer dereference at 
0068

[  636.600934] IP: [] up_read+0x13/0x30
[  636.601446] PGD 222f5a067 PUD 224bc6067 PMD 0
[  636.601915] Oops: 0002 [#1] SMP
...

[  636.623264] Call Trace:
[  636.623597]  [] drop_super+0x16/0x30
[  636.624104]  [] freeze_bdev+0x60/0xe0
[  636.624627]  [] ploop_ioctl+0x2b8e/0x2c80 [ploop]
[  636.625237]  [] ? handle_mm_fault+0x5b4/0xf50
[  636.625822]  [] blkdev_ioctl+0x2df/0x770
[  636.626439]  [] block_ioctl+0x41/0x50
[  636.627162]  [] do_vfs_ioctl+0x255/0x4f0
[  636.627827]  [] SyS_ioctl+0x54/0xa0

due to this problem. Similar might exist in ploop_snapshot(): you bdput 
plo->dm_crypt_bdev in find_and_freeze_bdev(), so it can disappear by the 
time we need it for thaw_bdev after calling complete_snapshot.



Btw, it's usually good idea to give a patch some simple testing prior 
sending it for review ;)



Thanks,

Maxim



On 08/15/2016 10:27 AM, Andrey Ryabinin wrote:

On dm-crypted ploop fs is mounted not on ploop but on dm-crypt device.
Thus freeze/thaw used by some ploop's ioctl doesn't freeze/thaw filesystem.
To fix that, we store pointer to dm-crypt block device inside ploop_device
struct, and use it to freeze/thaw filesystem.

https://jira.sw.ru/browse/PSBM-50858

Signed-off-by: Andrey Ryabinin 
---
  drivers/block/ploop/dev.c   | 27 +--
  drivers/block/ploop/io_direct.c |  9 +
  drivers/md/dm-crypt.c   |  8 +++-
  drivers/md/dm.c |  6 ++
  drivers/md/dm.h |  2 ++
  include/linux/ploop/ploop.h | 38 ++
  6 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 83b0e32..acc120b 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3318,13 +3318,22 @@ void ploop_relax(struct ploop_device * plo)
  }
  
  /* search disk for first partition bdev with mounted fs and freeze it */

-static struct super_block *find_and_freeze_bdev(struct gendisk *disk,
+static struct super_block *find_and_freeze_bdev(struct ploop_device *plo,
struct block_device ** bdev_pp)
  {
struct super_block  * sb   = NULL;
struct block_device * bdev = NULL;
+   struct gendisk *disk = plo->disk;
int i;
  
+	bdev = ploop_get_dm_crypt_bdev(plo);

+   if (bdev) {
+   sb = freeze_bdev(bdev);
+   bdput(bdev);
+   *bdev_pp = bdev;
+   return sb;
+   }
+
for (i = 0; i <= (*bdev_pp)->bd_part_count; i++) {
bdev = bdget_disk(disk, i);
if (!bdev)
@@ -3398,7 +3407,7 @@ static int ploop_snapshot(struct ploop_device * plo, 
unsigned long arg,
/* freeze_bdev() may trigger ploop_bd_full() */
plo->maintenance_type = PLOOP_MNTN_SNAPSHOT;
mutex_unlock(&plo->ctl_mutex);
-   sb = find_and_freeze_bdev(plo->disk, &bdev);
+   sb = find_and_freeze_bdev(plo, &bdev);
mutex_lock(&plo->ctl_mutex);
plo->maintenance_type = PLOOP_MNTN_OFF;
if (IS_ERR(sb)) {
@@ -4916,6 +4925,7 @@ static int ploop_push_backup_stop(struct ploop_device 
*plo, unsigned long arg)
  static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev)
  {
struct super_block *sb = plo->sb;
+   struct block_device *dm_crypt_bdev;
  
  	if (!test_bit(PLOOP_S_RUNNING, &plo->state))

return -EINVAL;
@@ -4926,7 +4936,12 @@ static int ploop_freeze(struct ploop_device *plo, struct 
block_device *bdev)
if (plo->freeze_state == PLOOP_F_THAWING)
return -EBUSY;
  
+	dm_crypt_bdev = ploop_get_dm_crypt_bdev(plo);

+   if (dm_crypt_bdev)
+   bdev = dm_crypt_bdev;
sb = freeze_bdev(bdev);
+   ploop_put_dm_crypt_bdev(dm_crypt_bdev);
+
if (sb && IS_ERR(sb))
return PTR_ERR(sb);
  
@@ -4938,6 +4953,7 @@ static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev)

  static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)
  {
struct super_block *sb = plo->sb;
+   struct block_device *dm_crypt_bdev;
int err;
  
  	if (!test_bit(PLOOP_S_RUNNING, &plo->state))

@@ -4952,8 +4968,15 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
plo->sb = NULL;
plo->freeze_state = PLOOP_F_THAWING;
  
+	dm_crypt_bdev = __ploop_get_dm_crypt_bdev(plo);

+   if (dm_crypt_bdev)
+   bdev = dm_crypt_bdev;
+
mutex_unlock(&plo->ctl_mutex);
+
err = thaw_bdev(bdev, sb);
+   pl

Re: [Devel] [PATCH rh7] ploop: add support for dm-crypted ploops

2016-08-18 Thread Maxim Patlasov

On 08/18/2016 09:49 AM, Andrey Ryabinin wrote:



On 08/18/2016 04:21 AM, Maxim Patlasov wrote:

Andrey,


ploop_freeze() must use __ploop_get_dm_crypt_bdev(), not 
ploop_get_dm_crypt_bdev()!


YUp.


Another problem is that the patch does not ensure that plo->dm_crypt_bdev stays 
the same between PLOOP_IOC_FREEZE/THAW. I could easily get:

[  636.600190] BUG: unable to handle kernel NULL pointer dereference at 
0068
[  636.600934] IP: [] up_read+0x13/0x30
[  636.601446] PGD 222f5a067 PUD 224bc6067 PMD 0
[  636.601915] Oops: 0002 [#1] SMP
...

[  636.623264] Call Trace:
[  636.623597]  [] drop_super+0x16/0x30
[  636.624104]  [] freeze_bdev+0x60/0xe0
[  636.624627]  [] ploop_ioctl+0x2b8e/0x2c80 [ploop]
[  636.625237]  [] ? handle_mm_fault+0x5b4/0xf50
[  636.625822]  [] blkdev_ioctl+0x2df/0x770
[  636.626439]  [] block_ioctl+0x41/0x50
[  636.627162]  [] do_vfs_ioctl+0x255/0x4f0
[  636.627827]  [] SyS_ioctl+0x54/0xa0

due to this problem.

I don't see direct relation between this crash and changing plo->dm_crypt_bdev 
in between freeze/thaw.


In ideal world freeze+thaw+freeze+thaw must change bd_fsfreeze_count 
like this: 0 --> 1 --> 0 --> 1 --> 0. But if plo->dm_crypt_bdev changes 
in between, the first increment may be applied to one bdev, then 
decrement to another bdev, then the second increment again to the first 
bdev.




The way I read this

struct super_block *freeze_bdev(struct block_device *bdev)
...
if (++bdev->bd_fsfreeze_count > 1) {

sb = get_super(bdev);  // this returned NULL, so we don't have 
active super block on this device.
drop_super(sb); // NULL ptr deref
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return sb;


AFAIU, this can happen if we call freeze_bdev() twice on block device which 
doesn't have mounted fs.
Isn't this a bug in freeze_bdev()?


No. The bug is to call freeze_bdev second time if the first time it 
returned sb == NULL.



And how did you get this crash?


1) freeze (increments ploop bdev counter)
2) cryptsetup luksOpen
3) thaw (do nothing)
4) cryptsetup luksClose
5) freeze (attempts to increment ploop bdev counter again)




Similar might exist in ploop_snapshot(): you bdput plo->dm_crypt_bdev in 
find_and_freeze_bdev(), so it can disappear by the time we need it for thaw_bdev 
after calling complete_snapshot.


Btw, it's usually good idea to give a patch some simple testing prior sending 
it for review ;)


http://devopsreactions.tumblr.com/post/88260308392/testing-my-own-code


:)




Thanks,

Maxim





___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v2 1/2] ploop: drop bdev refcounter on freeze_bdev() failure

2016-08-18 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 


On 08/18/2016 09:51 AM, Andrey Ryabinin wrote:

If freeze_bdev() called in find_and_freeze_bdev() fails we should
drop a reference counter grabbed by bdget_disk() call.

Signed-off-by: Andrey Ryabinin 
---
  drivers/block/ploop/dev.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 83b0e32..453d36e 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3339,7 +3339,10 @@ static struct super_block *find_and_freeze_bdev(struct 
gendisk *disk,
bdev = NULL;
}
  
-	*bdev_pp = bdev;

+   if (IS_ERR(sb))
+   bdput(bdev);
+   else
+   *bdev_pp = bdev;
return sb;
  }
  


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v2 2/2] ploop: add support for dm-crypted ploops

2016-08-18 Thread Maxim Patlasov

Andrey,


Simple freeze+thaw leads to kernel panic due to null pointer dereference 
in ploop_thaw() because plo->sb may be NULL (if nothing mounted on ploop):



static int ploop_thaw(struct ploop_device *plo)
{
struct super_block *sb = plo->sb;
struct block_device *bdev = sb->s_bdev;



Again, please, give a patch some simple testing before sending it for 
review.



Thanks,

Maxim


On 08/18/2016 09:51 AM, Andrey Ryabinin wrote:

On dm-crypted ploop fs is mounted not on ploop but on dm-crypt device.
Thus freeze/thaw used by some ploop's ioctl doesn't freeze/thaw filesystem.
To fix that, we store pointer to dm-crypt block device inside ploop_device
struct, and use it to freeze/thaw filesystem.

https://jira.sw.ru/browse/PSBM-50858

Signed-off-by: Andrey Ryabinin 
---

Changes since v1:
   - fixed deadlock in ploop_freeze()
   - use bdgrab()/bdput() to keep bdev alive
   - use sb->s_bdev in ploop_thaw() instead of plo->dm_crypt_bdev in case
 it changed after freeze

  drivers/block/ploop/dev.c   | 26 +-
  drivers/block/ploop/io_direct.c | 12 
  drivers/md/dm-crypt.c   |  8 +++-
  drivers/md/dm.c |  6 ++
  drivers/md/dm.h |  2 ++
  include/linux/ploop/ploop.h | 32 
  6 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 453d36e..5271c47 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3318,13 +3318,20 @@ void ploop_relax(struct ploop_device * plo)
  }
  
  /* search disk for first partition bdev with mounted fs and freeze it */

-static struct super_block *find_and_freeze_bdev(struct gendisk *disk,
+static struct super_block *find_and_freeze_bdev(struct ploop_device *plo,
struct block_device ** bdev_pp)
  {
struct super_block  * sb   = NULL;
struct block_device * bdev = NULL;
+   struct gendisk *disk = plo->disk;
int i;
  
+	bdev = ploop_get_dm_crypt_bdev(plo);

+   if (bdev) {
+   sb = freeze_bdev(bdev);
+   goto out;
+   }
+
for (i = 0; i <= (*bdev_pp)->bd_part_count; i++) {
bdev = bdget_disk(disk, i);
if (!bdev)
@@ -3339,6 +3346,7 @@ static struct super_block *find_and_freeze_bdev(struct 
gendisk *disk,
bdev = NULL;
}
  
+out:

if (IS_ERR(sb))
bdput(bdev);
else
@@ -3401,7 +3409,7 @@ static int ploop_snapshot(struct ploop_device * plo, 
unsigned long arg,
/* freeze_bdev() may trigger ploop_bd_full() */
plo->maintenance_type = PLOOP_MNTN_SNAPSHOT;
mutex_unlock(&plo->ctl_mutex);
-   sb = find_and_freeze_bdev(plo->disk, &bdev);
+   sb = find_and_freeze_bdev(plo, &bdev);
mutex_lock(&plo->ctl_mutex);
plo->maintenance_type = PLOOP_MNTN_OFF;
if (IS_ERR(sb)) {
@@ -4929,18 +4937,25 @@ static int ploop_freeze(struct ploop_device *plo, 
struct block_device *bdev)
if (plo->freeze_state == PLOOP_F_THAWING)
return -EBUSY;
  
+	if (plo->dm_crypt_bdev)

+   bdev = plo->dm_crypt_bdev;
+
+   bdgrab(bdev);
sb = freeze_bdev(bdev);
-   if (sb && IS_ERR(sb))
+   if (sb && IS_ERR(sb)) {
+   bdput(bdev);
return PTR_ERR(sb);
+   }
  
  	plo->sb = sb;

plo->freeze_state = PLOOP_F_FROZEN;
return 0;
  }
  
-static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)

+static int ploop_thaw(struct ploop_device *plo)
  {
struct super_block *sb = plo->sb;
+   struct block_device *bdev = sb->s_bdev;
int err;
  
  	if (!test_bit(PLOOP_S_RUNNING, &plo->state))

@@ -4957,6 +4972,7 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
  
  	mutex_unlock(&plo->ctl_mutex);

err = thaw_bdev(bdev, sb);
+   bdput(bdev);
mutex_lock(&plo->ctl_mutex);
  
  	BUG_ON(plo->freeze_state != PLOOP_F_THAWING);

@@ -5086,7 +5102,7 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t 
fmode, unsigned int cm
err = ploop_freeze(plo, bdev);
break;
case PLOOP_IOC_THAW:
-   err = ploop_thaw(plo, bdev);
+   err = ploop_thaw(plo);
break;
default:
err = -EINVAL;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index c12e3c8..6663964 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -871,13 +871,25 @@ static int dio_invalidate_cache(struct address_space * 
mapping,
  retry:
err = invalidate_inode_pages2(mapping);
if (err) {
+   struct ploop_device *plo = bdev->bd_disk->private_data;
+   struct block_device *dm_crypt_bdev;
+
  

Re: [Devel] [PATCH rh7] ploop: add support for dm-crypted ploops

2016-08-19 Thread Maxim Patlasov

Andrey,


On 08/19/2016 03:44 AM, Andrey Ryabinin wrote:



On 08/19/2016 02:40 AM, Maxim Patlasov wrote:

On 08/18/2016 09:49 AM, Andrey Ryabinin wrote:


On 08/18/2016 04:21 AM, Maxim Patlasov wrote:

Andrey,


ploop_freeze() must use __ploop_get_dm_crypt_bdev(), not 
ploop_get_dm_crypt_bdev()!


YUp.


Another problem is that the patch does not ensure that plo->dm_crypt_bdev stays 
the same between PLOOP_IOC_FREEZE/THAW. I could easily get:

[  636.600190] BUG: unable to handle kernel NULL pointer dereference at 
0068
[  636.600934] IP: [] up_read+0x13/0x30
[  636.601446] PGD 222f5a067 PUD 224bc6067 PMD 0
[  636.601915] Oops: 0002 [#1] SMP
...

[  636.623264] Call Trace:
[  636.623597]  [] drop_super+0x16/0x30
[  636.624104]  [] freeze_bdev+0x60/0xe0
[  636.624627]  [] ploop_ioctl+0x2b8e/0x2c80 [ploop]
[  636.625237]  [] ? handle_mm_fault+0x5b4/0xf50
[  636.625822]  [] blkdev_ioctl+0x2df/0x770
[  636.626439]  [] block_ioctl+0x41/0x50
[  636.627162]  [] do_vfs_ioctl+0x255/0x4f0
[  636.627827]  [] SyS_ioctl+0x54/0xa0

due to this problem.

I don't see direct relation between this crash and changing plo->dm_crypt_bdev 
in between freeze/thaw.

In ideal world freeze+thaw+freeze+thaw must change bd_fsfreeze_count like this: 0 --> 1 
--> 0 --> 1 --> 0. But if plo->dm_crypt_bdev changes in between, the first 
increment may be applied to one bdev, then decrement to another bdev, then the second 
increment again to the first bdev.


The way I read this

struct super_block *freeze_bdev(struct block_device *bdev)
...
 if (++bdev->bd_fsfreeze_count > 1) {

 sb = get_super(bdev);  // this returned NULL, so we don't have active 
super block on this device.
 drop_super(sb); // NULL ptr deref
 mutex_unlock(&bdev->bd_fsfreeze_mutex);
 return sb;
 


AFAIU, this can happen if we call freeze_bdev() twice on block device which 
doesn't have mounted fs.
Isn't this a bug in freeze_bdev()?

No. The bug is to call freeze_bdev second time if the first time it returned sb 
== NULL.


Disagreed. This would mean that all callers supposed to synchronize 
freeze_bdev()/thaw_bdev() sequence.
Otherwise we can get:

CPU 0:  CPU1:
freeze_bdev() //return NULL
freeze_bdev() //NULL-ptr deref
thaw_bdev()
thaw_bdev()

So, how do you propose to fix that case? If freeze_bdev() on CPU1 is illegal, 
that would mean that every freeze_bdev()...thaw_bdev() should
be in critical section, iow surrounded by some per-bdev lock. While comment 
near freeze_bdev() says that it totally ok if multiple freeze
request arrive simultaneously.


Makes sense. Will you send the patch upstream?



Note, that this patch is also still affected by this bug, e.g.:
1) cryptsetup open /dev/ploop0p1 test
2) freeze ploop
3) dmsetup suspend test //suspend calls lock_fs()->freeze_bdev() -> 
NULL ptr deref


So the only sane way to fix this is to add NULL check before drop_super().


"freeze ploop" might "thaw" immediately if !sb. Yeah, this is racy, but 
the chances of such a race are negligibly slim. On the other hand, your 
fix for freeze_bdev() is so simple and straightforward, that I'd vote 
for that (even if the fix is rejected upstream for some reasons).

Thanks,
Maxim
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v3 3/4] fs/block_dev: fix NULL ptr deref in freeze_bdev()

2016-08-19 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 


On 08/19/2016 06:00 AM, Andrey Ryabinin wrote:

freeze_bdev() called twice on the same block device without
mounted filesystem will lead to NULL-ptr deref:

  BUG: unable to handle kernel NULL pointer dereference at 0068
  IP: [] up_read+0x29/0x40

  Call Trace:
   [] drop_super+0x16/0x30
   [] freeze_bdev+0x4b/0xd0
   [] __dm_suspend+0xeb/0x220
   [] ? table_load+0x390/0x390
   [] dm_suspend+0xda/0x100
   [] ? up_read+0x1f/0x40
   [] dev_suspend+0x190/0x250
   [] ctl_ioctl+0x247/0x520
   [] dm_ctl_ioctl+0x13/0x20
   [] do_vfs_ioctl+0x27e/0x550
   [] SyS_ioctl+0x54/0xa0
   [] system_call_fastpath+0x16/0x1b

Check get_super() result to fix that.

https://jira.sw.ru/browse/PSBM-50858

Signed-off-by: Andrey Ryabinin 
---
  fs/block_dev.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4575c62..325ee71 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device *bdev)
 * thaw_bdev drops it.
 */
sb = get_super(bdev);
-   drop_super(sb);
+   if (sb)
+   drop_super(sb);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return sb;
}


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v3 1/4] ploop: drop bdev refcounter on freeze_bdev() failure

2016-08-19 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 


On 08/19/2016 06:00 AM, Andrey Ryabinin wrote:

If freeze_bdev() called in find_and_freeze_bdev() fails we should
drop a reference counter grabbed by bdget_disk() call.

Signed-off-by: Andrey Ryabinin 
Acked-by: Maxim Patlasov 
---
  drivers/block/ploop/dev.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 83b0e32..453d36e 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3339,7 +3339,10 @@ static struct super_block *find_and_freeze_bdev(struct 
gendisk *disk,
bdev = NULL;
}
  
-	*bdev_pp = bdev;

+   if (IS_ERR(sb))
+   bdput(bdev);
+   else
+   *bdev_pp = bdev;
return sb;
  }
  


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v3 2/4] ploop: keep frozen block device pointer instead of super_block pointer

2016-08-19 Thread Maxim Patlasov

Acked-by: Maxim Patlasov 


On 08/19/2016 06:00 AM, Andrey Ryabinin wrote:

For encrypted ploop we will need to know what block_device was frozen
in ploop_freeze(), so we could thaw() it. We don't need to store super_block,
because we sould be able to get it from frozen block device.

https://jira.sw.ru/browse/PSBM-50858

Signed-off-by: Andrey Ryabinin 
---
  drivers/block/ploop/dev.c   | 13 +++--
  include/linux/ploop/ploop.h |  2 +-
  2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 453d36e..8ed402f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4918,7 +4918,7 @@ static int ploop_push_backup_stop(struct ploop_device 
*plo, unsigned long arg)
  
  static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev)

  {
-   struct super_block *sb = plo->sb;
+   struct super_block *sb;
  
  	if (!test_bit(PLOOP_S_RUNNING, &plo->state))

return -EINVAL;
@@ -4933,14 +4933,15 @@ static int ploop_freeze(struct ploop_device *plo, 
struct block_device *bdev)
if (sb && IS_ERR(sb))
return PTR_ERR(sb);
  
-	plo->sb = sb;

+   plo->frozen_bdev = bdev;
plo->freeze_state = PLOOP_F_FROZEN;
return 0;
  }
  
-static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)

+static int ploop_thaw(struct ploop_device *plo)
  {
-   struct super_block *sb = plo->sb;
+   struct block_device *bdev = plo->frozen_bdev;
+   struct super_block *sb = bdev->bd_super;
int err;
  
  	if (!test_bit(PLOOP_S_RUNNING, &plo->state))

@@ -4952,7 +4953,7 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
if (plo->freeze_state == PLOOP_F_THAWING)
return -EBUSY;
  
-	plo->sb = NULL;

+   plo->frozen_bdev = NULL;
plo->freeze_state = PLOOP_F_THAWING;
  
  	mutex_unlock(&plo->ctl_mutex);

@@ -5086,7 +5087,7 @@ static int ploop_ioctl(struct block_device *bdev, fmode_t 
fmode, unsigned int cm
err = ploop_freeze(plo, bdev);
break;
case PLOOP_IOC_THAW:
-   err = ploop_thaw(plo, bdev);
+   err = ploop_thaw(plo);
break;
default:
err = -EINVAL;
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index b2ef6bd..8262a50 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -419,7 +419,7 @@ struct ploop_device
struct block_device *bdev;
struct request_queue*queue;
struct task_struct  *thread;
-   struct super_block  *sb;
+   struct block_device *frozen_bdev;
int freeze_state;
struct rb_node  link;
  


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v3 4/4] ploop: add support for dm-crypted ploops

2016-08-19 Thread Maxim Patlasov

Andrey,


The patch leads to kernel panic if someone (mistakenly) tries to "thaw" 
ploop device without "freeze" prior. Before the patch the code was 
immune to this because ploop_thaw does nothing if state != FROZEN. The 
rest of the patch looks fine, so



Acked-by: Maxim Patlasov 


for your patch with the following trivial fix on top:


@@ -4941,7 +4941,7 @@ static int ploop_freeze(struct ploop_device 
*plo, struct block_device *bdev)

 static int ploop_thaw(struct ploop_device *plo)
 {
struct block_device *bdev = plo->frozen_bdev;
-   struct super_block *sb = bdev->bd_super;
+   struct super_block *sb = bdev ? bdev->bd_super : NULL;
int err;

if (!test_bit(PLOOP_S_RUNNING, &plo->state))


Thanks,

Maxim





On 08/19/2016 06:00 AM, Andrey Ryabinin wrote:

On dm-crypted ploop fs is mounted not on ploop but on dm-crypt device.
Thus freeze/thaw used by some ploop's ioctl doesn't freeze/thaw filesystem.
To fix that, we store pointer to dm-crypt block device inside ploop_device
struct, and use it to freeze/thaw filesystem.

https://jira.sw.ru/browse/PSBM-50858

Signed-off-by: Andrey Ryabinin 
---
  drivers/block/ploop/dev.c   | 21 ++---
  drivers/block/ploop/io_direct.c | 12 
  drivers/md/dm-crypt.c   |  8 +++-
  drivers/md/dm.c |  6 ++
  drivers/md/dm.h |  2 ++
  include/linux/ploop/ploop.h | 32 
  6 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 8ed402f..44b5e5e 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3318,13 +3318,20 @@ void ploop_relax(struct ploop_device * plo)
  }
  
  /* search disk for first partition bdev with mounted fs and freeze it */

-static struct super_block *find_and_freeze_bdev(struct gendisk *disk,
+static struct super_block *find_and_freeze_bdev(struct ploop_device *plo,
struct block_device ** bdev_pp)
  {
struct super_block  * sb   = NULL;
struct block_device * bdev = NULL;
+   struct gendisk *disk = plo->disk;
int i;
  
+	bdev = ploop_get_dm_crypt_bdev(plo);

+   if (bdev) {
+   sb = freeze_bdev(bdev);
+   goto out;
+   }
+
for (i = 0; i <= (*bdev_pp)->bd_part_count; i++) {
bdev = bdget_disk(disk, i);
if (!bdev)
@@ -3339,6 +3346,7 @@ static struct super_block *find_and_freeze_bdev(struct 
gendisk *disk,
bdev = NULL;
}
  
+out:

if (IS_ERR(sb))
bdput(bdev);
else
@@ -3401,7 +3409,7 @@ static int ploop_snapshot(struct ploop_device * plo, 
unsigned long arg,
/* freeze_bdev() may trigger ploop_bd_full() */
plo->maintenance_type = PLOOP_MNTN_SNAPSHOT;
mutex_unlock(&plo->ctl_mutex);
-   sb = find_and_freeze_bdev(plo->disk, &bdev);
+   sb = find_and_freeze_bdev(plo, &bdev);
mutex_lock(&plo->ctl_mutex);
plo->maintenance_type = PLOOP_MNTN_OFF;
if (IS_ERR(sb)) {
@@ -4929,9 +4937,15 @@ static int ploop_freeze(struct ploop_device *plo, struct 
block_device *bdev)
if (plo->freeze_state == PLOOP_F_THAWING)
return -EBUSY;
  
+	if (plo->dm_crypt_bdev)

+   bdev = plo->dm_crypt_bdev;
+
+   bdgrab(bdev);
sb = freeze_bdev(bdev);
-   if (sb && IS_ERR(sb))
+   if (sb && IS_ERR(sb)) {
+   bdput(bdev);
return PTR_ERR(sb);
+   }
  
  	plo->frozen_bdev = bdev;

plo->freeze_state = PLOOP_F_FROZEN;
@@ -4958,6 +4972,7 @@ static int ploop_thaw(struct ploop_device *plo)
  
  	mutex_unlock(&plo->ctl_mutex);

err = thaw_bdev(bdev, sb);
+   bdput(bdev);
mutex_lock(&plo->ctl_mutex);
  
  	BUG_ON(plo->freeze_state != PLOOP_F_THAWING);

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index c12e3c8..6663964 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -871,13 +871,25 @@ static int dio_invalidate_cache(struct address_space * 
mapping,
  retry:
err = invalidate_inode_pages2(mapping);
if (err) {
+   struct ploop_device *plo = bdev->bd_disk->private_data;
+   struct block_device *dm_crypt_bdev;
+
printk("PLOOP: failed to invalidate page cache %d/%d\n", err, 
attempt2);
if (attempt2)
return err;
attempt2 = 1;
  
  		mutex_unlock(&mapping->host->i_mutex);

+
+   dm_crypt_bdev = ploop_get_dm_crypt_bdev(plo);
+   if (dm_crypt_bdev)
+   b

[Devel] [PATCH rh7 0/2] overlayfs: fix handling MNT_NOATIME

2016-09-13 Thread Maxim Patlasov
The series fix a bug revealed by generic/120 from xfstests: overlayfs ignores
noatime mount option. It always uses options of underlying fs instead.

The series fix it similarly to mainline. See per-patch descriptions
for details.

https://jira.sw.ru/browse/PSBM-51009

---

Maxim Patlasov (2):
  ovl: update atime on upperovl: update atime on upper
  fs: use original vfsmount for touch_atime


 fs/open.c|3 +++
 fs/overlayfs/dir.c   |1 +
 fs/overlayfs/inode.c |   29 ++---
 fs/overlayfs/overlayfs.h |4 
 fs/overlayfs/super.c |8 ++--
 include/linux/fs.h   |4 +++-
 6 files changed, 43 insertions(+), 6 deletions(-)

--
Signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] ovl: update atime on upperovl: update atime on upper

2016-09-13 Thread Maxim Patlasov
Backport d719e8f268 from mainline:

ovl: update atime on upper

Fix atime update logic in overlayfs.

This patch adds an i_op->update_time() handler to overlayfs inodes.  This
forwards atime updates to the upper layer only.  No atime updates are done
on lower layers.

Remove implicit atime updates to underlying files and directories with
O_NOATIME.  Remove explicit atime update in ovl_readlink().

Clear atime related mnt flags from cloned upper mount.  This means atime
updates are controlled purely by overlayfs mount options.

Reported-by: Konstantin Khlebnikov 
Signed-off-by: Miklos Szeredi 

https://jira.sw.ru/browse/PSBM-51009

Signed-off-by: Maxim Patlasov 
---
 fs/overlayfs/dir.c   |1 +
 fs/overlayfs/inode.c |   29 ++---
 fs/overlayfs/overlayfs.h |4 
 fs/overlayfs/super.c |8 ++--
 4 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 5402b9b..881987c 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -966,6 +966,7 @@ const struct inode_operations_wrapper 
ovl_dir_inode_operations = {
.getxattr   = ovl_getxattr,
.listxattr  = ovl_listxattr,
.removexattr= ovl_removexattr,
+   .update_time= ovl_update_time,
},
.rename2= ovl_rename2,
 };
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 5fe7acf..77f2da4 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -196,8 +196,6 @@ static int ovl_readlink(struct dentry *dentry, char __user 
*buf, int bufsiz)
if (!realinode->i_op->readlink)
return -EINVAL;
 
-   touch_atime(&realpath);
-
return realinode->i_op->readlink(realpath.dentry, buf, bufsiz);
 }
 
@@ -383,6 +381,29 @@ struct inode *ovl_d_select_inode(struct dentry *dentry)
return d_backing_inode(realpath.dentry);
 }
 
+int ovl_update_time(struct inode *inode, struct timespec *ts, int flags)
+{
+   struct dentry *alias;
+   struct path upperpath;
+
+   if (!(flags & S_ATIME))
+   return 0;
+
+   alias = d_find_any_alias(inode);
+   if (!alias)
+   return 0;
+
+   ovl_path_upper(alias, &upperpath);
+   if (upperpath.dentry) {
+   touch_atime(&upperpath);
+   inode->i_atime = d_inode(upperpath.dentry)->i_atime;
+   }
+
+   dput(alias);
+
+   return 0;
+}
+
 static const struct inode_operations_wrapper ovl_file_inode_operations = {
.ops = {
.setattr= ovl_setattr,
@@ -392,6 +413,7 @@ static const struct inode_operations_wrapper 
ovl_file_inode_operations = {
.getxattr   = ovl_getxattr,
.listxattr  = ovl_listxattr,
.removexattr= ovl_removexattr,
+   .update_time= ovl_update_time,
},
.dentry_open= ovl_dentry_open,
 };
@@ -406,6 +428,7 @@ static const struct inode_operations 
ovl_symlink_inode_operations = {
.getxattr   = ovl_getxattr,
.listxattr  = ovl_listxattr,
.removexattr= ovl_removexattr,
+   .update_time= ovl_update_time,
 };
 
 struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
@@ -421,7 +444,7 @@ struct inode *ovl_new_inode(struct super_block *sb, umode_t 
mode,
 
inode->i_ino = get_next_ino();
inode->i_mode = mode;
-   inode->i_flags |= S_NOATIME | S_NOCMTIME;
+   inode->i_flags |= S_NOCMTIME;
 
switch (mode) {
case S_IFDIR:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 8da9684..61ba0d5 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -174,6 +174,7 @@ ssize_t ovl_getxattr(struct dentry *dentry, const char 
*name,
 ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
 int ovl_removexattr(struct dentry *dentry, const char *name);
 struct inode *ovl_d_select_inode(struct dentry *dentry);
+int ovl_update_time(struct inode *inode, struct timespec *ts, int flags);
 
 struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
struct ovl_entry *oe);
@@ -181,6 +182,9 @@ static inline void ovl_copyattr(struct inode *from, struct 
inode *to)
 {
to->i_uid = from->i_uid;
to->i_gid = from->i_gid;
+   to->i_atime = from->i_atime;
+   to->i_mtime = from->i_mtime;
+   to->i_ctime = from->i_ctime;
 }
 
 /* dir.c */
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 24ec90b..e633d0f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -533,7 +533,7 @@ out:
 
 struct file *ovl_path_open(struct path *path, int flags)
 {
-   return dentry_open(path, flags, current_cred());
+   return dentry_open(path, flags | O_NOATIME, current_cred());
 }
 
 static void ovl_put_super(struct super_block *sb)
@@ -1013,6 +1013,

[Devel] [PATCH rh7 2/2] fs: use original vfsmount for touch_atime

2016-09-13 Thread Maxim Patlasov
In case of overlayfs, vfs_open is called recursively filling filp->f_path
with pointers to real dentry and vfsmount (upper or lower). Hence, touch_atime
has no access to mnt_flags of original (overlayfs) vfsmount. The patch fixes
the problem by saving original path to a new field of struct file.

The patch to be reverted when RHEL picks up 4bacc9c92 from mainline:

>Make file->f_path always point to the overlay dentry so that the path in
>/proc/pid/fd is correct and to ensure that label-based LSMs have access to 
> the
>overlay as well as the underlay (path-based LSMs probably don't need it).

Picking it now is premature because it introduced a lot of bugs (outside 
overlay)
and the chances are high to overlook some related fixes in mainline.

https://jira.sw.ru/browse/PSBM-51009

Signed-off-by: Maxim Patlasov 
---
 fs/open.c  |3 +++
 include/linux/fs.h |4 +++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index bc60c05..8c066b1 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -894,6 +894,9 @@ int vfs_open(const struct path *path, struct file *filp,
struct inode *inode = path->dentry->d_inode;
iop_dentry_open_t dentry_open = get_dentry_open_iop(inode);
 
+   if (!filp->f_original_path.mnt)
+   filp->f_original_path = *path;
+
if (dentry_open)
return dentry_open(path->dentry, filp, cred);
else {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f1c3d5b..7b84d49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -961,6 +961,7 @@ struct file {
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
+   struct path f_original_path;
 #define f_dentry   f_path.dentry
struct inode*f_inode;   /* cached value */
const struct file_operations*f_op;
@@ -2095,7 +2096,8 @@ extern void touch_atime(struct path *);
 static inline void file_accessed(struct file *file)
 {
if (!(file->f_flags & O_NOATIME))
-   touch_atime(&file->f_path);
+   touch_atime(file->f_original_path.mnt ?
+   &file->f_original_path : &file->f_path);
 }
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ext4: fix filtering trusted xattr

2016-09-20 Thread Maxim Patlasov
Commit 4f7ce4dd4741cb65df018028aaefedb298915aa6:

Author: Pavel Tikhomirov 
ve/xattr: allow to set trusted.xxx for container admin

relaxed capability check on setxattr path, but overlooked
to do the same on getxattr path. Hence, container admin
became able to set trusted xattrs, but not seeing them:

# setfattr -h -n trusted.name file
# echo $?
0
# getfattr -dm- file


This broke generic/062 from xfstests.

https://jira.sw.ru/browse/PSBM-51009

Signed-off-by: Maxim Patlasov 
---
 fs/ext4/xattr_trusted.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/xattr_trusted.c b/fs/ext4/xattr_trusted.c
index 95f1f4a..49dd83f 100644
--- a/fs/ext4/xattr_trusted.c
+++ b/fs/ext4/xattr_trusted.c
@@ -19,7 +19,7 @@ ext4_xattr_trusted_list(struct dentry *dentry, char *list, 
size_t list_size,
const size_t prefix_len = XATTR_TRUSTED_PREFIX_LEN;
const size_t total_len = prefix_len + name_len + 1;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ve_capable(CAP_SYS_ADMIN))
return 0;
 
if (list && total_len <= list_size) {

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fs: hold reference on original path

2016-09-21 Thread Maxim Patlasov
struct file holds references on its f_path.mnt and f_path.dentry by calling
path_get(&f->f_path) from do_dentry_open(). Let's use the same technique
for f->f_original_path. Otherwise, f_original_path.dentry can be deleted while
file still references it leading to NULL-ptr-deref on 
f->f_original_path.dentry->d_inode.

https://jira.sw.ru/browse/PSBM-52373

Signed-off-by: Maxim Patlasov 
---
 fs/file_table.c |6 ++
 fs/open.c   |   18 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 957c476..b8982d8 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -242,6 +242,8 @@ static void __fput(struct file *file)
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = dentry->d_inode;
+   struct dentry *original_dentry = file->f_original_path.dentry;
+   struct vfsmount *original_mnt = file->f_original_path.mnt;
 
might_sleep();
 
@@ -273,10 +275,14 @@ static void __fput(struct file *file)
drop_file_write_access(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
+   file->f_original_path.dentry = NULL;
+   file->f_original_path.mnt = NULL;
file->f_inode = NULL;
file_free(file);
dput(dentry);
mntput(mnt);
+   dput(original_dentry);
+   mntput(original_mnt);
 }
 
 static DEFINE_SPINLOCK(delayed_fput_lock);
diff --git a/fs/open.c b/fs/open.c
index 8c066b1..25dbc85 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -893,16 +893,28 @@ int vfs_open(const struct path *path, struct file *filp,
 {
struct inode *inode = path->dentry->d_inode;
iop_dentry_open_t dentry_open = get_dentry_open_iop(inode);
+   int do_cleanup = 0;
+   int ret;
 
-   if (!filp->f_original_path.mnt)
+   if (!filp->f_original_path.mnt) {
filp->f_original_path = *path;
+   path_get(&filp->f_original_path);
+   do_cleanup = 1;
+   }
 
if (dentry_open)
-   return dentry_open(path->dentry, filp, cred);
+   ret = dentry_open(path->dentry, filp, cred);
else {
filp->f_path = *path;
-   return do_dentry_open(filp, NULL, cred);
+   ret = do_dentry_open(filp, NULL, cred);
}
+
+   if (ret && do_cleanup) {
+   path_put(&filp->f_original_path);
+   filp->f_original_path.mnt = NULL;
+   filp->f_original_path.dentry = NULL;
+   }
+   return ret;
 }
 EXPORT_SYMBOL(vfs_open);
 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] ext4: Discard preallocated block before swap_extents

2016-09-23 Thread Maxim Patlasov

Dima,


The patch looks fine and it works in my tests, but it slightly changes 
user-visible behavior: before patch, if ioctl(MOVE) failed, the user 
always saw moved_len=0. Now, with the patch applied, it can be !=0. 
(because ext4_ioctl() copies "me" back to user even if err != 0)



I understand that it matter of taste, but the code is more readable 
(imho) if all functions keep output args intact on failure. Hence, it 
would be nice to accumulate "cur_len" in some local var and then, in the 
end of ext4_move_extents, assign *moved_len only if ret == 0.



The above is pretty minor, so:

Reviewed-by: Maxim Patlasov 

Thanks,
Maxim


On 09/20/2016 10:41 AM, Dmitry Monakhov wrote:

Inode preallocation consists of two parts (used and unused) fully controlled
by inode, so it must be discarded before swap extents.
Currently we may skip drop_preallocation if file is sparse.

This patch does:
- Moves ext4_discard_preallocations to ext4_swap_extents.
   This makes more readable and reliable for future changes.
- Cleanup main move_extent loop

xfstests:ext4/024 (pended: 
https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
Signed-off-by: Dmitry Monakhov 
---
  fs/ext4/extents.c |  2 ++
  fs/ext4/move_extent.c | 17 +
  2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d7ccb7f..757ffb8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5799,9 +5799,11 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
BUG_ON(!inode_is_locked(inode1));
BUG_ON(!inode_is_locked(inode2));
  
+	ext4_discard_preallocations(inode1);

*erp = ext4_es_remove_extent(inode1, lblk1, count);
if (unlikely(*erp))
return 0;
+   ext4_discard_preallocations(inode2);
*erp = ext4_es_remove_extent(inode2, lblk2, count);
if (unlikely(*erp))
return 0;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 6fc14de..24a9586 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -632,7 +632,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
  
  		ret = get_ext_path(orig_inode, o_start, &path);

if (ret)
-   goto out;
+   break;
ex = path[path->p_depth].p_ext;
next_blk = ext4_ext_next_allocated_block(path);
cur_blk = le32_to_cpu(ex->ee_block);
@@ -642,7 +642,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
if (next_blk == EXT_MAX_BLOCKS) {
o_start = o_end;
ret = -ENODATA;
-   goto out;
+   break;
}
d_start += next_blk - o_start;
o_start = next_blk;
@@ -654,7 +654,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
o_start = cur_blk;
/* Extent inside requested range ?*/
if (cur_blk >= o_end)
-   goto out;
+   break;
} else { /* in_range(o_start, o_blk, o_len) */
cur_len += cur_blk - o_start;
}
@@ -687,17 +687,10 @@ ext4_move_extents(struct file *o_filp, struct file 
*d_filp, __u64 orig_blk,
break;
o_start += cur_len;
d_start += cur_len;
+   *moved_len += cur_len;
}
-   *moved_len = o_start - orig_blk;
-   if (*moved_len > len)
-   *moved_len = len;
-
  out:
-   if (*moved_len) {
-   ext4_discard_preallocations(orig_inode);
-   ext4_discard_preallocations(donor_inode);
-   }
-
+   WARN_ON(*moved_len > len);
ext4_ext_drop_refs(path);
kfree(path);
ext4_double_up_write_data_sem(orig_inode, donor_inode);


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fs: avoid holding extra reference on original path

2016-09-30 Thread Maxim Patlasov
When vfs_open() opens ordinary file (not overlayfs one), do_dentry_open() will
call path_get(&f->f_path) anyway, so it doesn't make sense to acquire extra
references by another path_get().

The above is enough to satisfy LTP syscalls/fcntl/fcntl24 when called on top
of ordinary fs (not overlayfs), but for overlayfs it's also necessary to ensure
that fs/locks.c::generic_add_lease() passes original (overlayfs) dentry to
check_conflicting_open(). This goes in accordance with mainstream:

> commit 6343a2120862f7023006c8091ad95c1f16a32077
> Author: Miklos Szeredi 
> Date:   Fri Jul 1 14:56:07 2016 +0200
>
> locks: use file_inode()
>
> (Another one for the f_path debacle.)
>
> ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.
>
> The reason is that generic_add_lease() used filp->f_path.dentry->inode
> while all the others use file_inode().  This makes a difference for files
> opened on overlayfs since the former will point to the overlay inode the
> latter to the underlying inode.
>
> So generic_add_lease() added the lease to the overlay inode and
> generic_delete_lease() removed it from the underlying inode.  When the 
> file
> was released the lease remained on the overlay inode's lock list, 
> resulting
> in use after free.
>
> Reported-by: Eryu Guan 
> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay 
> and f_inode to the underlay")
> Cc: 
> Signed-off-by: Miklos Szeredi 
> Reviewed-by: Jeff Layton 
> Signed-off-by: J. Bruce Fields 
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 7c5f91b..ee1b15f 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1628,7 +1628,7 @@ generic_add_lease(struct file *filp, long arg, struct 
> file_lock **flp, void **pr
>  {
> struct file_lock *fl, *my_fl = NULL, *lease;
> struct dentry *dentry = filp->f_path.dentry;
> -   struct inode *inode = dentry->d_inode;
> +   struct inode *inode = file_inode(filp);
>     struct file_lock_context *ctx;
> bool is_deleg = (*flp)->fl_flags & FL_DELEG;
> int error;

https://jira.sw.ru/browse/PSBM-52817

Signed-off-by: Maxim Patlasov 
---
 fs/locks.c |5 +++--
 fs/open.c  |2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index a5ab0c0..f6c89d7 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1560,8 +1560,9 @@ static int
 generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void 
**priv)
 {
struct file_lock *fl, **before, **my_before = NULL, *lease;
-   struct dentry *dentry = filp->f_path.dentry;
-   struct inode *inode = dentry->d_inode;
+   struct dentry *dentry = filp->f_original_path.mnt ?
+   filp->f_original_path.dentry: filp->f_path.dentry;
+   struct inode *inode = filp->f_path.dentry->d_inode;
bool is_deleg = (*flp)->fl_flags & FL_DELEG;
int error;
 
diff --git a/fs/open.c b/fs/open.c
index 25dbc85..84eb289 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -896,7 +896,7 @@ int vfs_open(const struct path *path, struct file *filp,
int do_cleanup = 0;
int ret;
 
-   if (!filp->f_original_path.mnt) {
+   if (!filp->f_original_path.mnt && dentry_open) {
filp->f_original_path = *path;
path_get(&filp->f_original_path);
do_cleanup = 1;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 v2] pfcache: hide trusted.pfcache from listxattr

2016-10-04 Thread Maxim Patlasov

The patch itself looks fine, so:

Reviewed-by: Maxim Patlasov 


As for patch description, its last paragraph looks misleading. We seem 
to be safe only until RHEL reuses name_index == 9 (what was 
EXT4_XATTR_INDEX_TRUSTED_CSUM in the v1 of the patch). I think it would 
be more honest to state it clearly: we do not support pfcache-ed ploop 
images from rh7-3.10.0-327.28.2.vz7.17.10.


Thanks,
Maxim

On 09/27/2016 08:31 AM, Pavel Tikhomirov wrote:

Need it to be able to rsync xattrs for encripted containers which
have pfcache_csum disabled on superblock.

When there is no PFCACHE_CSUM on superblock or we are not
capable(CAP_SYS_ADMIN), we do not allow get/set trusted.pfcache.
So hide trusted.pfcache from list also in thouse two cases.

Tested that: list/get xattr "trusted.pfcache" is OK on file
setxattred on vz7.17.11 kernel, xattr entry on which had wrong
e_name_index (reverted EXT4_XATTR_INDEX_TRUSTED_CSUM), works as
there is no such entry at all. As in ext4_xattr_list_entries
-> ext4_xattr_handler where is special check for it.

v2: do checks in ext4_xattr_trusted_list which is used for
listing trusted.xxx xattrs

https://jira.sw.ru/browse/PSBM-52180
Signed-off-by: Pavel Tikhomirov 
---
  fs/ext4/xattr_trusted.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/fs/ext4/xattr_trusted.c b/fs/ext4/xattr_trusted.c
index 49dd83f..131b6b8 100644
--- a/fs/ext4/xattr_trusted.c
+++ b/fs/ext4/xattr_trusted.c
@@ -19,6 +19,11 @@ ext4_xattr_trusted_list(struct dentry *dentry, char *list, 
size_t list_size,
const size_t prefix_len = XATTR_TRUSTED_PREFIX_LEN;
const size_t total_len = prefix_len + name_len + 1;
  
+	if (!strcmp(name, EXT4_DATA_CSUM_NAME) &&

+   (!capable(CAP_SYS_ADMIN) ||
+!test_opt2(dentry->d_inode->i_sb, PFCACHE_CSUM)))
+   return 0;
+
if (!ve_capable(CAP_SYS_ADMIN))
return 0;
  


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 0/2] fuse: fix signals handling while processing request

2016-10-13 Thread Maxim Patlasov

Stas,


The series look fine, so:

Acked-by: Maxim Patlasov 


But, please, refine the description of the second patch. It must explain 
clearly why the patch fixes the problem:


block_sigs() blocks ordinary non-fatal signals as expected, but 
surprisingly SIGTRAP is special: it does not matter whether it comes 
before or after block_sigs(), the latter does not affect SIGTRAP at all! 
And in contrast, wait_event_killable() is immune to it -- only fatal sig 
can wake it up.


Thanks,
Maxim

On 10/13/2016 03:03 AM, Stanislav Kinsburskiy wrote:

This patch fixes wrong SIGBUS result in page fault handler for fuse file, when
process received a signal.

https://jira.sw.ru/browse/PSBM-53581

---

Stanislav Kinsburskiy (2):
   new helper: wait_event_killable_exclusive()
   fuse: handle only fatal signals while waiting request answer


  fs/fuse/dev.c|   42 --
  include/linux/wait.h |   26 ++
  2 files changed, 42 insertions(+), 26 deletions(-)

--


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fuse: process small sync direct reads synchronously

2016-10-13 Thread Maxim Patlasov
It is useless to process small sync direct reads asynchronously,
because that optimization works only if we send more than one
request to userspace fused concurrently.

On the other hand, the patch workarounds a problem reported by AK:

> If a cluster hangs (for any reason), all max_background fuse
> requests are usually consumed and async io is impossible.
> Unfortunately it also makes impossible to read .vstorage.info
> that is necessary to investigate why the cluster hanged.

Signed-off-by: Maxim Patlasov 
---
 fs/fuse/file.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6b9e4ea..49ee3de 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3307,8 +3307,11 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct 
iovec *iov,
 * We cannot asynchronously extend the size of a file. We have no method
 * to wait on real async I/O requests, so we must submit this request
 * synchronously.
+* And it's useless to process small sync READs asynchronously.
 */
-   if (!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE)
+   if ((!is_sync_kiocb(iocb) && (offset + count > i_size) && rw == WRITE) 
||
+   (rw != WRITE && is_sync_kiocb(iocb) &&
+count <= (FUSE_MAX_PAGES_PER_REQ << PAGE_SHIFT)))
io->async = false;
 
if (rw == WRITE)

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] ext4: fix mkdir operations with overlayfs

2016-10-14 Thread Maxim Patlasov

Thanks! You may be interested to search devel@openvz.org archives for:

Subject: [PATCH rh7] ext4: ext4_mkdir must set S_IOPS_WRAPPER bit

Date: Mon, 25 Jul 2016 14:01:16 -0700


On 10/14/2016 09:47 AM, Vladimir Meshkov wrote:

ext4 supports an extended operations like rename2, but
inode isn't correctly marked after mkdir.

Signed-off-by: Alexey Lyashkov >

---
 fs/ext4/namei.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0adc6df..bebe698 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2413,6 +2413,7 @@ retry:

  inode->i_op = &ext4_dir_inode_operations.ops;
  inode->i_fop = &ext4_dir_operations;
+ inode->i_flags |= S_IOPS_WRAPPER;
  err = ext4_init_new_dir(handle, dir, inode);
  if (err)
  goto out_clear_inode;
--
1.8.3.1


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 0/2] fuse: fix signals handling while processing request

2016-10-15 Thread Maxim Patlasov

Stas,


On 10/14/2016 03:30 AM, Stanislav Kinsburskiy wrote:




14.10.2016 02:23, Maxim Patlasov пишет:

Stas,


The series look fine, so:

Acked-by: Maxim Patlasov 


But, please, refine the description of the second patch. It must 
explain clearly why the patch fixes the problem:


block_sigs() blocks ordinary non-fatal signals as expected, but 
surprisingly SIGTRAP is special: it does not matter whether it comes 
before or after block_sigs(), the latter does not affect SIGTRAP at 
all! And in contrast, wait_event_killable() is immune to it -- only 
fatal sig can wake it up.




No, Maxim. You make a mistake here.


Yes, I agree.



There is nothing special with SIGTRAP (although it's being sometimes 
sent via force_sig_info()).


OK.



The problem is described as it is: block_sigs() doesn't (!) clear 
TIG_SIGPENDING flag. All it does is blocking future signals to arrive.


OK. But I disagree with your explanation why it doesn't clear the flag.



Moreover, __set_task_blocked() call recalc_sigpending(), which check 
whether any of the signals to block is present in process pending 
mask, and if so - set (!) TIF_SIGPENDING on the task.


Only if the correspondent bit is NOT set in blocked->sig[]:

> case 1: ready  = signal->sig[0] &~ blocked->sig[0];

That's definitely not the case for block_sigs() who set all bits in 
blocked->sig[] except sigmask(SIGKILL). So, in our case 
recalc_sigpending() can only clear TIG_SIGPENDING flag, not set it.



IOW, any pending signal remains pending after call to blocks_sigs().


No. Conversely: all non-fatal signals does NOT remain pending after call 
to blocks_sigs(). You can ascertain it yourself by debugging how 
block_sigs() react on "kill -USR1".


And that's is the root of the issue (as it described in the patch 
comment).


No. The root of the issue is in ptrace(2) calling ptrace_request(), 
calling task_set_jobctl_pending(), setting JOBCTL_TRAP_STOP in 
task->jobctl. So, when fuse calls block_sigs(), it eventually calls 
recalc_sigpending() which calls recalc_sigpending_tsk() which looks like 
this:


>if ((t->jobctl & JOBCTL_PENDING_MASK) ||
>PENDING(&t->pending, &t->blocked) ||
>PENDING(&t->signal->shared_pending, &t->blocked)) {
>set_tsk_thread_flag(t, TIF_SIGPENDING);
>return 1;

but as we know ptrace(2) already set JOBCTL_TRAP_STOP in task->jobctl:

> #define JOBCTL_TRAP_MASK(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> #define JOBCTL_PENDING_MASK(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)

To sum it up, the patch from Al Viro that you backported doesn't change 
fuse behavior w.r.t signals, but it nicely replace signal_pending with 
fatal_signal_pending, and the latter solves our case because it checks 
for SIGKILL explicitly:


> static inline int fatal_signal_pending(struct task_struct *p)
> {
> return signal_pending(p) && __fatal_signal_pending(p);
> }

> static inline int __fatal_signal_pending(struct task_struct *p)
> {
> return unlikely(sigismember(&p->pending.signal, SIGKILL));
> }

Thanks,
Maxim




Thanks,
Maxim

On 10/13/2016 03:03 AM, Stanislav Kinsburskiy wrote:
This patch fixes wrong SIGBUS result in page fault handler for fuse 
file, when

process received a signal.

https://jira.sw.ru/browse/PSBM-53581

---

Stanislav Kinsburskiy (2):
   new helper: wait_event_killable_exclusive()
   fuse: handle only fatal signals while waiting request answer


  fs/fuse/dev.c|   42 
--

  include/linux/wait.h |   26 ++
  2 files changed, 42 insertions(+), 26 deletions(-)

--






___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 0/2] fuse: fix signals handling while processing request

2016-10-17 Thread Maxim Patlasov

Stas,


Now, when we fully understand the patch, will you fix description of 
98cbcb14d? The following does not seem correct:



>IOW, any signal, arrived to the process, which does page fault 
handling on fuse

>file, _before_ request_wait_answer() is called, will lead to request
>interruption, producing SIGBUS error in page fault handler 
(filemap_fault).



Thanks,

Maxim


On 10/17/2016 04:59 AM, Stanislav Kinsburskiy wrote:




16.10.2016 05:21, Maxim Patlasov пишет:

Stas,


On 10/14/2016 03:30 AM, Stanislav Kinsburskiy wrote:




14.10.2016 02:23, Maxim Patlasov пишет:

Stas,


The series look fine, so:

Acked-by: Maxim Patlasov 


But, please, refine the description of the second patch. It must 
explain clearly why the patch fixes the problem:


block_sigs() blocks ordinary non-fatal signals as expected, but 
surprisingly SIGTRAP is special: it does not matter whether it 
comes before or after block_sigs(), the latter does not affect 
SIGTRAP at all! And in contrast, wait_event_killable() is immune to 
it -- only fatal sig can wake it up.




No, Maxim. You make a mistake here.


Yes, I agree.



There is nothing special with SIGTRAP (although it's being sometimes 
sent via force_sig_info()).


OK.



The problem is described as it is: block_sigs() doesn't (!) clear 
TIG_SIGPENDING flag. All it does is blocking future signals to arrive.


OK. But I disagree with your explanation why it doesn't clear the flag.



Moreover, __set_task_blocked() call recalc_sigpending(), which check 
whether any of the signals to block is present in process pending 
mask, and if so - set (!) TIF_SIGPENDING on the task.


Only if the correspondent bit is NOT set in blocked->sig[]:

> case 1: ready  = signal->sig[0] &~ blocked->sig[0];

That's definitely not the case for block_sigs() who set all bits in 
blocked->sig[] except sigmask(SIGKILL). So, in our case 
recalc_sigpending() can only clear TIG_SIGPENDING flag, not set it.



Agreed.


IOW, any pending signal remains pending after call to blocks_sigs().


No. Conversely: all non-fatal signals does NOT remain pending after 
call to blocks_sigs(). You can ascertain it yourself by debugging how 
block_sigs() react on "kill -USR1".


And that's is the root of the issue (as it described in the patch 
comment).


No. The root of the issue is in ptrace(2) calling ptrace_request(), 
calling task_set_jobctl_pending(), setting JOBCTL_TRAP_STOP in 
task->jobctl. So, when fuse calls block_sigs(), it eventually calls 
recalc_sigpending() which calls recalc_sigpending_tsk() which looks 
like this:


>if ((t->jobctl & JOBCTL_PENDING_MASK) ||
>PENDING(&t->pending, &t->blocked) ||
>PENDING(&t->signal->shared_pending, &t->blocked)) {
>set_tsk_thread_flag(t, TIF_SIGPENDING);
>return 1;

but as we know ptrace(2) already set JOBCTL_TRAP_STOP in task->jobctl:

> #define JOBCTL_TRAP_MASK(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> #define JOBCTL_PENDING_MASK(JOBCTL_STOP_PENDING | 
JOBCTL_TRAP_MASK)




Nice catch, thanks.

To sum it up, the patch from Al Viro that you backported doesn't 
change fuse behavior w.r.t signals, but it nicely replace 
signal_pending with fatal_signal_pending, and the latter solves our 
case because it checks for SIGKILL explicitly:


> static inline int fatal_signal_pending(struct task_struct *p)
> {
> return signal_pending(p) && __fatal_signal_pending(p);
> }

> static inline int __fatal_signal_pending(struct task_struct *p)
> {
> return unlikely(sigismember(&p->pending.signal, SIGKILL));
> }

Thanks,
Maxim




Thanks,
Maxim

On 10/13/2016 03:03 AM, Stanislav Kinsburskiy wrote:
This patch fixes wrong SIGBUS result in page fault handler for 
fuse file, when

process received a signal.

https://jira.sw.ru/browse/PSBM-53581

---

Stanislav Kinsburskiy (2):
   new helper: wait_event_killable_exclusive()
   fuse: handle only fatal signals while waiting request answer


  fs/fuse/dev.c|   42 
--

  include/linux/wait.h |   26 ++
  2 files changed, 42 insertions(+), 26 deletions(-)

--










___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] md: add support for dm-crypted ploops

2016-10-18 Thread Maxim Patlasov
Previos patch adding support for dm-crypt ploops naively suggested
that every time we build dm-crypt device, crypt_ctr() is called, and
vice versa - every time we dismantle it, crypt_dtr() is called. But
in practice, crypt_ctr/dtr is called more than once because md->map
is RCU-protected. For example, during resize, new dm target is
constructed and registetred in md->map, then the former md->map is
released.

Only dm-crypt knows how to find underlaying ploop device. That's why
the patch implements ploop_modify methods of dm-crypt target. And
only general dm code (dm.c, dm-table.c, dm-ioctl.c and friends) knows
which instance of dm-target is actual and which is obsoleted. That's
why the patch orchestrates calling this new method from general code,
close to __bind/__unbind managing md->map pointer.

https://jira.sw.ru/browse/PSBM-53386

Signed-off-by: Maxim Patlasov 
---
 drivers/md/dm-crypt.c |   25 -
 drivers/md/dm-ioctl.c |3 +++
 drivers/md/dm-table.c |   15 +++
 drivers/md/dm.c   |4 +++-
 drivers/md/dm.h   |1 +
 include/linux/device-mapper.h |9 +
 6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index bcdd794..41019b8 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1647,10 +1647,8 @@ static void crypt_dtr(struct dm_target *ti)
if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
cc->iv_gen_ops->dtr(cc);
 
-   if (cc->dev) {
-   ploop_set_dm_crypt_bdev(cc->dev->bdev, NULL);
+   if (cc->dev)
dm_put_device(ti, cc->dev);
-   }
 
kzfree(cc->cipher);
kzfree(cc->cipher_string);
@@ -1919,8 +1917,6 @@ static int crypt_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
goto bad;
}
 
-   ploop_set_dm_crypt_bdev(cc->dev->bdev, 
dm_md_get_bdev(dm_table_get_md(ti->table)));
-
if (sscanf(argv[4], "%llu%c", &tmpll, &dummy) != 1) {
ti->error = "Invalid device sector";
goto bad;
@@ -2173,6 +2169,24 @@ static void crypt_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits->max_segment_size = PAGE_SIZE;
 }
 
+static void crypt_ploop_modify(struct dm_target *ti, int action)
+{
+   struct crypt_config *cc = ti->private;
+
+   if (cc && cc->dev)
+   switch (action) {
+   case DM_PLOOP_ATTACH:
+   ploop_set_dm_crypt_bdev(cc->dev->bdev,
+   dm_md_get_bdev(dm_table_get_md(ti->table)));
+   break;
+   case DM_PLOOP_DETACH:
+   ploop_set_dm_crypt_bdev(cc->dev->bdev, NULL);
+   break;
+   default:
+   BUG();
+   }
+}
+
 static struct target_type crypt_target = {
.name   = "crypt",
.version = {1, 14, 1},
@@ -2188,6 +2202,7 @@ static struct target_type crypt_target = {
.merge  = crypt_merge,
.iterate_devices = crypt_iterate_devices,
.io_hints = crypt_io_hints,
+   .ploop_modify = crypt_ploop_modify,
 };
 
 static int __init dm_crypt_init(void)
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 0fe4233..fd4b2cc 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1029,6 +1029,9 @@ static int do_resume(struct dm_ioctl *param)
return PTR_ERR(old_map);
}
 
+   dm_table_ploop_modify(old_map, DM_PLOOP_DETACH);
+   dm_table_ploop_modify(new_map, DM_PLOOP_ATTACH);
+
if (dm_table_get_mode(new_map) & FMODE_WRITE)
set_disk_ro(dm_disk(md), 0);
else
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 16ba55a..e910fbe 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1705,3 +1705,18 @@ void dm_table_run_md_queue_async(struct dm_table *t)
 }
 EXPORT_SYMBOL(dm_table_run_md_queue_async);
 
+void dm_table_ploop_modify(struct dm_table *t, int action)
+{
+   unsigned int i;
+
+   if (!t)
+   return;
+
+   /* attach or detach the targets */
+   for (i = 0; i < t->num_targets; i++) {
+   struct dm_target *tgt = t->targets + i;
+
+   if (tgt->type->ploop_modify)
+   tgt->type->ploop_modify(tgt, action);
+   }
+}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a7993cf..210221e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -3088,7 +3088,9 @@ static void __dm_destroy(struct mapped_device *md, bool 
wait)
   dm_device_name(md), atomic_read(&md->holders));
 
dm_sysfs_exit(md);
-   dm_table_destroy(__un

[Devel] [PATCH rh7] ploop: push_backup: fix pbd->ppb_lock deadlock

2016-10-19 Thread Maxim Patlasov
Ploop push_backup must use spin_lock_irq[save] for pbd->ppb_lock.
Otherwise a classic deadlock is possible:

1) vz_backup_client acquires ppb_lock:

  ploop_ioctl -->
ploop_push_backup_io -->
  ploop_push_backup_io_read -->
ploop_push_backup_io_get -->
  ploop_pb_get_pending -->
spin_lock(&pbd->ppb_lock);

2) ploop_thread spins on ppb_lock while holding plo->lock:

  ploop_thread calls "spin_lock_irq(&plo->lock);", then -->
process_bio_queue_main -->
  process_bio_queue_one -->
ploop_pb_bio_detained -->
  ploop_pb_check_and_clear_bit -->
spin_lock(&pbd->ppb_lock);

3) vz_backup_client is interrupted by bio completion:

  bio_endio -->
bio->bi_end_io (== dio_endio_async) -->
  ploop_complete_io_request -->
ploop_complete_io_state -->
  spin_lock_irqsave(&plo->lock, flags);

>From now on, interrupt handler cannot proceed because ploop_thread
holds plo->lock, and ploop_thread cannot proceed because vz_backup_client
holds ppb_lock, and vz_backup_client cannot proceed because it's
interrupted by that interrupt handler. Classic deadlock.

Signed-off-by: Maxim Patlasov 
---
 drivers/block/ploop/push_backup.c |   60 +++--
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/drivers/block/ploop/push_backup.c 
b/drivers/block/ploop/push_backup.c
index 525576d..f825575 100644
--- a/drivers/block/ploop/push_backup.c
+++ b/drivers/block/ploop/push_backup.c
@@ -349,7 +349,7 @@ static int ploop_pb_health_monitor(void * data)
struct ploop_pushbackup_desc *pbd = data;
struct ploop_device  *plo = pbd->plo;
 
-   spin_lock(&pbd->ppb_lock);
+   spin_lock_irq(&pbd->ppb_lock);
while (!kthread_should_stop() || pbd->ppb_state == PLOOP_PB_STOPPING) {
 
DEFINE_WAIT(_wait);
@@ -359,21 +359,21 @@ static int ploop_pb_health_monitor(void * data)
kthread_should_stop())
break;
 
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
schedule();
-   spin_lock(&pbd->ppb_lock);
+   spin_lock_irq(&pbd->ppb_lock);
}
finish_wait(&pbd->ppb_waitq, &_wait);
 
if (pbd->ppb_state == PLOOP_PB_STOPPING) {
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
mutex_lock(&plo->ctl_mutex);
ploop_pb_stop(pbd, true);
mutex_unlock(&plo->ctl_mutex);
-   spin_lock(&pbd->ppb_lock);
+   spin_lock_irq(&pbd->ppb_lock);
}
}
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
return 0;
 }
 
@@ -633,21 +633,21 @@ int ploop_pb_preq_add_pending(struct 
ploop_pushbackup_desc *pbd,
 {
BUG_ON(!pbd);
 
-   spin_lock(&pbd->ppb_lock);
+   spin_lock_irq(&pbd->ppb_lock);
 
if (pbd->ppb_state != PLOOP_PB_ALIVE) {
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
return -ESTALE;
}
 
if (!test_bit(PLOOP_S_PUSH_BACKUP, &pbd->plo->state)) {
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
return -EINTR;
}
 
if (check_bit_in_map(pbd->reported_map, pbd->ppb_block_max,
 preq->req_cluster)) {
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
return -EALREADY;
}
 
@@ -656,7 +656,7 @@ int ploop_pb_preq_add_pending(struct ploop_pushbackup_desc 
*pbd,
if (pbd->ppb_waiting)
complete(&pbd->ppb_comp);
 
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
return 0;
 }
 
@@ -708,20 +708,20 @@ unsigned long ploop_pb_stop(struct ploop_pushbackup_desc 
*pbd, bool do_merge)
if (pbd == NULL)
return 0;
 
-   spin_lock(&pbd->ppb_lock);
+   spin_lock_irq(&pbd->ppb_lock);
if (pbd->ppb_state == PLOOP_PB_DEAD) {
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
return 0;
}
pbd->ppb_state = PLOOP_PB_DEAD;
-   spin_unlock(&pbd->ppb_lock);
+   spin_unlock_irq(&pbd->ppb_lock);
 
ploop_pbs_fini(&pbd->pending_set);
ploop_pbs_fini(&pbd

[Devel] [PATCH rh7 1/2] vfs: make guard_bh_eod() more generic

2016-10-21 Thread Maxim Patlasov
The patch backports commit 59d43914ed7b96255271ad6b7b735344beffa3c0 from 
mainline:

vfs: make guard_bh_eod() more generic

This patchset implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

This patch (of 3):

guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
last sectors of a device, even if the block size is some multiple of the
physical sector size.  This makes guard_bh_eod() more generic and renames
it guard_bio_eod() so that we can use it without struct buffer_head
argument.

The reason for this change is that using mpage_readpages() for block
device requires to add this guard check in mpage code.

Signed-off-by: Akinobu Mita 
Cc: Jens Axboe 
Cc: Alexander Viro 
Cc: Jeff Moyer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Maxim Patlasov 
---
 fs/buffer.c |   26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2b709d4..a7cb15c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2972,7 +2972,7 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
 
 /*
  * This allows us to do IO even on the odd last sectors
- * of a device, even if the bh block size is some multiple
+ * of a device, even if the block size is some multiple
  * of the physical sector size.
  *
  * We'll just truncate the bio to the size of the device,
@@ -2982,10 +2982,11 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
  * errors, this only handles the "we need to be able to
  * do IO at the final sector" case.
  */
-static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh)
+static void guard_bio_eod(int rw, struct bio *bio)
 {
sector_t maxsector;
-   unsigned bytes;
+   struct bio_vec *bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
+   unsigned truncated_bytes;
 
maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9;
if (!maxsector)
@@ -3000,23 +3001,20 @@ static void guard_bh_eod(int rw, struct bio *bio, 
struct buffer_head *bh)
return;
 
maxsector -= bio->bi_sector;
-   bytes = bio->bi_size;
-   if (likely((bytes >> 9) <= maxsector))
+   if (likely((bio->bi_size >> 9) <= maxsector))
return;
 
-   /* Uhhuh. We've got a bh that straddles the device size! */
-   bytes = maxsector << 9;
+   /* Uhhuh. We've got a bio that straddles the device size! */
+   truncated_bytes = bio->bi_size - (maxsector << 9);
 
/* Truncate the bio.. */
-   bio->bi_size = bytes;
-   bio->bi_io_vec[0].bv_len = bytes;
+   bio->bi_size -= truncated_bytes;
+   bvec->bv_len -= truncated_bytes;
 
/* ..and clear the end of the buffer for reads */
if ((rw & RW_MASK) == READ) {
-   void *kaddr = kmap_atomic(bh->b_page);
-   memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes);
-   kunmap_atomic(kaddr);
-   flush_dcache_page(bh->b_page);
+   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   truncated_bytes);
}
 }
 
@@ -3057,7 +3055,7 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned 
long bio_flags)
bio->bi_flags |= bio_flags;
 
/* Take care of bh's that straddle the end of the device */
-   guard_bh_eod(rw, bio, bh);
+   guard_bio_eod(rw, bio);
 
if (buffer_meta(bh))
rw |= REQ_META;

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/2] vfs: guard end of device for mpage interface

2016-10-21 Thread Maxim Patlasov
The patch backports 4db96b71e3caea5bb39053d57683129e0682c66f from mainline:

vfs: guard end of device for mpage interface

Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.

Using mpage_readpages() for block device requires this guard check.

Signed-off-by: Akinobu Mita 
Cc: Jens Axboe 
Cc: Alexander Viro 
Cc: Jeff Moyer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Maxim Patlasov 
---
 fs/buffer.c   |2 +-
 fs/internal.h |5 +
 fs/mpage.c|2 ++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a7cb15c..c45200d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2982,7 +2982,7 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
  * errors, this only handles the "we need to be able to
  * do IO at the final sector" case.
  */
-static void guard_bio_eod(int rw, struct bio *bio)
+void guard_bio_eod(int rw, struct bio *bio)
 {
sector_t maxsector;
struct bio_vec *bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
diff --git a/fs/internal.h b/fs/internal.h
index b538b3d..6f4120e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -38,6 +38,11 @@ static inline int __sync_blockdev(struct block_device *bdev, 
int wait)
 #endif
 
 /*
+ * buffer.c
+ */
+extern void guard_bio_eod(int rw, struct bio *bio);
+
+/*
  * char_dev.c
  */
 extern void __init chrdev_init(void);
diff --git a/fs/mpage.c b/fs/mpage.c
index 0face1c..c4f7bf6c 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include "internal.h"
 
 /*
  * I/O completion handler for multipage BIOs.
@@ -74,6 +75,7 @@ static void mpage_end_io(struct bio *bio, int err)
 static struct bio *mpage_bio_submit(int rw, struct bio *bio)
 {
bio->bi_end_io = mpage_end_io;
+   guard_bio_eod(rw, bio);
submit_bio(rw, bio);
return NULL;
 }

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/2] vfs: avoid attempts to access beyond end of device

2016-10-21 Thread Maxim Patlasov
The series backports two commits from mainline fixing
attempts to read beyond end of device. This is useful
because since we began to use dm-crypt over ploop,
the message "attempt to access beyond end of device"
is printed (to dmesg and /var/log/messages) every time
we start encrypted container.

---

Maxim Patlasov (2):
  vfs: make guard_bh_eod() more generic
  vfs: guard end of device for mpage interface


 fs/buffer.c   |   26 --
 fs/internal.h |5 +
 fs/mpage.c|2 ++
 3 files changed, 19 insertions(+), 14 deletions(-)

--
Signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


  1   2   3   4   5   >