from:"\"Dmitry Monakhov\""

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-16 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Dima,
>
> I agree that the ploop barrier code is broken in many ways, but I don't 
> think the patch actually fixes it. I hope you would agree that 
> completion of REQ_FUA guarantees only landing that particular bio to the 
> disk; it says nothing about flushing previously submitted (and 
> completed) bio-s and it is also possible that power outage may catch us 
> when this REQ_FUA is already landed to the disk, but previous bio-s are 
> not yet.
Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
So yes. it would be more correct to tag WBI with FLUSH_FUA
> Hence, for RELOC_{A|S} requests we actually need something like that:
>
>   RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>   RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>
> (i.e. we do need to flush all previously submitted data before starting 
> to update BAT on disk)
>
Correct sequence:
RELOC_S: R1, W2, WBI:FLUSH_FUA
RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA

> not simply:
>
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>
> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
> could remove them completely (along we that optimization delaying 
> incoming FUA) and re-implement all this stuff from scratch:
>
> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
> REQ_FUA in preq->req_rw before calling ->submit(preq)
>
> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
> RELOC_A|S in ploop_index_update and map_wb_complete
>
> 3) For that optimization delaying incoming FUA (what we do now if 
> ploop_req_delay_fua_possible() returns true) we could introduce new 
> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
> and map_wb_complete (the same thing as 2) above). And, yes, let's 
> WARN_ON if we somehow missed its processing.
Yes. This was one of my ideas.
1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
PLOOP_IO_FLUSH_DELAYED.
2) fix ->write_page to handle flush as it does with fua.
>
> The only complication I foresee is about how to teach kaio to pre-flush 
> in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
> already pretty convoluted.
>
Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
> we checks for REQ_FUA after clearing it! This makes all FUA-s on 
> ordinary kaio_submit path silently lost...
>
> Thanks,
> Maxim
>
>
> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>> barrier code is broken in many ways:
>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
>> write_page (for indexes)
>> So in case of grow_dev we have following sequance:
>>
>> E_RELOC_DATA_READ:
>>   ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>->delta->allocate
>>   ->io->submit_allloc: dio_submit_alloc
>> ->dio_submit_pad
>> E_DATA_WBI : data written, time to update index
>>->delta->allocate_complete:ploop_index_update
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>  ->write_page
>>  ->ploop_map_wb_complete
>>->ploop_wb_complete_post_process
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> E_RELOC_NULLIFY:
>>
>> ->submit()
>>
>> This patch unify barrier handling like follows:
>> - Add assertation to ploop_complete_request for FORCE_{FLUSH,FUA} state
>> - Perform explicit FUA inside index_update for RELOC requests.
>>
>> This makes reloc sequence optimal:
>> RELOC_S: R1, W2, WBI:FUA
>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> https://jira.sw.ru/browse/PSBM-47107
>> Signed-off-by: Dmitry Monakhov 
>> ---
>>   drivers/block/ploop/dev.c | 10 +++---
>>   drivers/block/ploop/map.c | 29 -
>>   2 files changed, 19 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index 96f7850..998fe71 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>>

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-16 Thread Dmitry Monakhov

Dmitry Monakhov  writes:

> Maxim Patlasov  writes:
>
>> Dima,
>>
>> I agree that the ploop barrier code is broken in many ways, but I don't 
>> think the patch actually fixes it. I hope you would agree that 
>> completion of REQ_FUA guarantees only landing that particular bio to the 
>> disk; it says nothing about flushing previously submitted (and 
>> completed) bio-s and it is also possible that power outage may catch us 
>> when this REQ_FUA is already landed to the disk, but previous bio-s are 
>> not yet.
> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
> So yes. it would be more correct to tag WBI with FLUSH_FUA
>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>
>>   RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>   RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>
>> (i.e. we do need to flush all previously submitted data before starting 
>> to update BAT on disk)
>>
> Correct sequence:
> RELOC_S: R1, W2, WBI:FLUSH_FUA
> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>
>> not simply:
>>
>>> RELOC_S: R1, W2, WBI:FUA
>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>
>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and 
>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we 
>> could remove them completely (along we that optimization delaying 
>> incoming FUA) and re-implement all this stuff from scratch:
>>
>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set 
>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>
>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating 
>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for 
>> RELOC_A|S in ploop_index_update and map_wb_complete
>>
>> 3) For that optimization delaying incoming FUA (what we do now if 
>> ploop_req_delay_fua_possible() returns true) we could introduce new 
>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update 
>> and map_wb_complete (the same thing as 2) above). And, yes, let's 
>> WARN_ON if we somehow missed its processing.
> Yes. This was one of my ideas.
> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
> PLOOP_IO_FLUSH_DELAYED.
> 2) fix ->write_page to handle flush as it does with fua.
>>
>> The only complication I foresee is about how to teach kaio to pre-flush 
>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's 
>> already pretty convoluted.
>>
> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
which is async and not suitable for page_write.
Max let's make an agreement about terminology.
The reason I wrote this is because linux internally interpret FUA as
preflush,write,postflush which is wrong from academic point of view but
it is the world we live it linux. This is the reason I read code
diferently from the way it was designed.
Let's state that ploop is an ideal world where:
FLUSH ==> preflush
FUA   ==> WRUTE,postflush
For what reasona we can perform reloc scheme as:

RELOC_A: R1,W2:FUA,WBI:FUA,W1:NULLIFY|FUA
RELOC_A: R1,W2:FUA,WBI:FUA

This allows effectively handle FUA and convert it to DELAYED_FLUSH where
possible. Also let's clarify may_fua_delay semantics to exact eng_state

may_fua_delay {

  int may_delay = 1;
  /* effectively this is equivalent of
 preq->eng_state != PLOOP_E_COMPLETE
 but it is more readable, and more error prone in future
  */
  if (preq->eng_state != PLOOP_E_DATA_WBI)
  may_delay = 0
  if ((test_bit(PLOOP_REQ_RELOC_S, &preq->state)) ||
 (test_bit(PLOOP_REQ_RELOC_A, &preq->state)))
  may_delay = 0;
  return may_delay;
}





k
>> Btw, I accidentally noticed awful silly bug in kaio_complete_io_state(): 
>> we checks for REQ_FUA after clearing it! This makes all FUA-s on 
>> ordinary kaio_submit path silently lost...
>>
>> Thanks,
>> Maxim
>>
>>
>> On 06/15/2016 07:49 AM, Dmitry Monakhov wrote:
>>> barrier code is broken in many ways:
>>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
>>> write_page (for indexes)
>>> So in case of grow_dev we have following sequance:
>>>
>>> E_RELOC_DATA_READ:
>>>   ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->sta

Re: [Devel] [vzlin-dev] [PATCH rh7] ploop: fix counting bio_qlen

2016-06-17 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> The commit ec1eeb868 (May 22 2015) ported "separate queue for discard bio"
> patch from RHEL6-based kernel incorrectly. Original patch stated clearly
> that if we want to decrement bio_discard_qlen, bio_qlen must not change:
>
> @@ -500,7 +502,7 @@ ploop_bio_queue(struct ploop_device * pl
> (err = ploop_discard_add_bio(plo->fbd, bio))) {
> BIO_ENDIO(bio, err);
> list_add(&preq->list, &plo->free_list);
> -   plo->bio_qlen--;
> +   plo->bio_discard_qlen--;
> plo->bio_total--;
> return;
> }
>
> but that port did the opposite:
>
> @@ -521,6 +523,7 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
> BIO_ENDIO(plo->queue, bio, err);
> list_add(&preq->list, &plo->free_list);
> plo->bio_qlen--;
> +   plo->bio_discard_qlen--;
> plo->bio_total--;
> return;
> }
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c |1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index db55be3..e1fbfcf 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -523,7 +523,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
>   }
>   BIO_ENDIO(plo->queue, bio, err);
>   list_add(&preq->list, &plo->free_list);
> - plo->bio_qlen--;
>   plo->bio_discard_qlen--;
>   plo->bio_total--;
>   return;
ACK


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [vzlin-dev] [PATCH rh7] ploop: io_kaio: fix silly bug in kaio_complete_io_state()

2016-06-17 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> It's useless to check for preq->req_rw & REQ_FUA after:
> preq->req_rw &= ~REQ_FUA;
ACK :) But in order to make it clear for others let's post original code
here!
...
  preq->req_rw &= ~REQ_FUA;

/* Convert requested fua to fsync */
   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC,
   &preq->state))
   post_fsync = 1;

if (!post_fsync &&
!ploop_req_delay_fua_possible(preq->req_rw, preq) &&
(preq->req_rw & REQ_FUA))
post_fsync = 1;

preq->req_rw &= ~REQ_FUA;
...


>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/io_kaio.c |2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
> index 79aa9af..de26319 100644
> --- a/drivers/block/ploop/io_kaio.c
> +++ b/drivers/block/ploop/io_kaio.c
> @@ -71,8 +71,6 @@ static void kaio_complete_io_state(struct ploop_request * 
> preq)
>   return;
>   }
>  
> - preq->req_rw &= ~REQ_FUA;
> -
>   /* Convert requested fua to fsync */
>   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
>   test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling

2016-06-19 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> On 06/16/2016 09:30 AM, Dmitry Monakhov wrote:
>> Dmitry Monakhov  writes:
>>
>>> Maxim Patlasov  writes:
>>>
>>>> Dima,
>>>>
>>>> I agree that the ploop barrier code is broken in many ways, but I don't
>>>> think the patch actually fixes it. I hope you would agree that
>>>> completion of REQ_FUA guarantees only landing that particular bio to the
>>>> disk; it says nothing about flushing previously submitted (and
>>>> completed) bio-s and it is also possible that power outage may catch us
>>>> when this REQ_FUA is already landed to the disk, but previous bio-s are
>>>> not yet.
>>> Actually it does (but implicitly) linux handles FUA as FLUSH,W,FLUSH.
>>> So yes. it would be more correct to tag WBI with FLUSH_FUA
>>>> Hence, for RELOC_{A|S} requests we actually need something like that:
>>>>
>>>>RELOC_S: R1, W2, FLUSH:WB, WBI:FUA
>>>>RELOC_A: R1, W2, FLUSH:WB, WBI:FUA, W1:NULLIFY:FUA
>>>>
>>>> (i.e. we do need to flush all previously submitted data before starting
>>>> to update BAT on disk)
>>>>
>>> Correct sequence:
>>> RELOC_S: R1, W2, WBI:FLUSH_FUA
>>> RELOC_A: R1, W2, WBI:FLUSH_FUA, W1:NULLIFY:FUA
>>>
>>>> not simply:
>>>>
>>>>> RELOC_S: R1, W2, WBI:FUA
>>>>> RELOC_A: R1, W2, WBI:FUA, W1:NULLIFY:FUA
>>>> Also, the patch makes the meaning of PLOOP_REQ_FORCE_FUA and
>>>> PLOOP_REQ_FORCE_FLUSH even more obscure than it used to be. I think we
>>>> could remove them completely (along we that optimization delaying
>>>> incoming FUA) and re-implement all this stuff from scratch:
>>>>
>>>> 1) The final "NULLIFY:FUA" is a peace of cake -- it's enough to set
>>>> REQ_FUA in preq->req_rw before calling ->submit(preq)
>>>>
>>>> 2) For "FLUSH:WB, WBI:FUA" it is actually enough to send bio updating
>>>> BAT on disk as REQ_FLUSH|REQ_FUA -- we can specify it explicitly for
>>>> RELOC_A|S in ploop_index_update and map_wb_complete
>>>>
>>>> 3) For that optimization delaying incoming FUA (what we do now if
>>>> ploop_req_delay_fua_possible() returns true) we could introduce new
>>>> ad-hoc PLOOP_IO_FLUSH_DELAYED enforcing REQ_FLUSH in ploop_index_update
>>>> and map_wb_complete (the same thing as 2) above). And, yes, let's
>>>> WARN_ON if we somehow missed its processing.
>>> Yes. This was one of my ideas.
>>> 1)FORCE_FLUSH, FORCE_FUA are redundant states which simply mirrors
>>> RELOC_{A,S} semantics. Lets get rid of that crap and simply introduce
>>> PLOOP_IO_FLUSH_DELAYED.
>>> 2) fix ->write_page to handle flush as it does with fua.
>>>> The only complication I foresee is about how to teach kaio to pre-flush
>>>> in kaio_write_page -- it's doable, but involves kaio_resubmit that's
>>>> already pretty convoluted.
>>>>
>>> Yes. kio_submit is correct, but kaio_write_page do not care about REQ_FLUSH.
>> Crap. Currently kaio can handles fsync only via kaio_queue_fsync_req
>> which is async and not suitable for page_write.
>
> I think it's doable to process page_write via kaio_fsync_thread, but 
> it's tricky.
>
>> Max let's make an agreement about terminology.
>> The reason I wrote this is because linux internally interpret FUA as
>> preflush,write,postflush which is wrong from academic point of view but
>> it is the world we live it linux.
>
> Are you sure that this  (FUA == preflush,write,postflush) is universally 
> true (i.e. no exceptions)? What about bio-based block-device drivers?
>
>> This is the reason I read code
>> diferently from the way it was designed.
>> Let's state that ploop is an ideal world where:
>> FLUSH ==> preflush
>> FUA   ==> WRUTE,postflush
>
> In ideal word FUA is not obliged to be handled by postflush: it's enough 
> to guarantee that *this* particular request went to platter, other 
> requests may remain not-flushed-yet. 
> Documentation/block/writeback_cache_control.txt is absolutely clear 
> about it:
>
>> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted 
>> from the
>> filesystem and will make sure that I/O completion for this request is only
>> signaled after the data has been committed to non-volatile storage.
>> ...
>> If the FUA bit is not natively sup

[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-20 Thread Dmitry Monakhov

(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 58d7580..a6d83fe 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init(&bl);
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit v2

2016-06-20 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..58d7580 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,27 +517,31 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(&plo->lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
-   spin_unlock_irq(&plo->lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
+   spin_unlock_irq(&plo->lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && (preq->req_rw & REQ_FUA))
+   if (!err && force_sync)
err = io->ops->sync(io);
 
-   spin_lock_irq(&plo->lock);
-   io->io_count++;
-   spin_unlock_irq(&plo->lock);
+   if (!force_sync) {
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   spin_unlock_irq(&plo->lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-20 Thread Dmitry Monakhov

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

   ->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FUA
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
   not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
- fix up fua handling for dio_submit

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c   |  8 +---
 drivers/block/ploop/io_direct.c | 29 +-
 drivers/block/ploop/io_kaio.c   | 17 ++--
 drivers/block/ploop/map.c   | 45 ++---
 include/linux/ploop/ploop.h |  8 
 5 files changed, 54 insertions(+), 53 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..d7ecd4a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
trace_submit(preq);
 
preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   postfua = !!(rw & REQ_FUA);
+   if (ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state);
+   postfua = 0;
}
-
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
@@ -238,14 +229,15 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   /* Very unlikely, but correct.
+* TODO: Optimize postfua via DELAY_FLUSH for any req state */
+   if (unlikely(!postfua))
+   rw2 |= REQ_FUA;
 
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
bio_num++;
}
-
ploop_complete_io_request(preq);
return;
 
@@ -1520,15 +1512,14 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
 
 static void
 dio_write_page(struct ploo

Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-21 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Dima,
>
> I agree with general approach of this patch, but there are some 
> (easy-to-fix) issues. See, please, inline comments below...
>
> On 06/20/2016 11:58 AM, Dmitry Monakhov wrote:
>> barrier code is broken in many ways:
>> Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
>> But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
>> write_page (for indexes)
>> So in case of grow_dev we have following sequance:
>>
>> E_RELOC_DATA_READ:
>>   ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>->delta->allocate
>>   ->io->submit_allloc: dio_submit_alloc
>> ->dio_submit_pad
>> E_DATA_WBI : data written, time to update index
>>->delta->allocate_complete:ploop_index_update
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>>  ->write_page
>>  ->ploop_map_wb_complete
>>->ploop_wb_complete_post_process
>>  ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> E_RELOC_NULLIFY:
>>
>> ->submit()
>>
>> BUG#2: currecntly kaio write_page silently ignores REQ_FUA
>
> Sorry, I can't agree, it actually does not ignore:
I've misstyped. I ment to say REQ_FLUSH.
>
>> static void
>> kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
>>  struct page * page, sector_t sec, int fua)
>> {
>> /* No FUA in kaio, convert it to fsync */
>> if (fua)
>> set_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state);
>
>
>> BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all 
>> bios via REQ_FUA
>> not just latest one.
>
> No need to tag *all*. See inline comments below.
>
>> This patch unify barrier handling like follows:
>> - Get rid of FORCE_{FLUSH,FUA}
>> - Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
>> - fix up fua handling for dio_submit
>>
>> This makes reloc sequence optimal:
>> io_direct
>> RELOC_S: R1, W2, WBI:FLUSH|FUA
>> RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
>> io_kaio
>> RELOC_S: R1, W2:FUA, WBI:FUA
>> RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA
>>
>> https://jira.sw.ru/browse/PSBM-47107
>> Signed-off-by: Dmitry Monakhov 
>> ---
>>   drivers/block/ploop/dev.c   |  8 +---
>>   drivers/block/ploop/io_direct.c | 29 +-
>>   drivers/block/ploop/io_kaio.c   | 17 ++--
>>   drivers/block/ploop/map.c   | 45 
>> ++---
>>   include/linux/ploop/ploop.h |  8 
>>   5 files changed, 54 insertions(+), 53 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index 96f7850..fbc5f2f 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>> @@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct 
>> ploop_request * preq)
>>   
>>  __TRACE("Z %p %u\n", preq, preq->req_cluster);
>>   
>> +if (!preq->error) {
>> +WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
>> +}
>>  while (preq->bl.head) {
>>  struct bio * bio = preq->bl.head;
>>  preq->bl.head = bio->bi_next;
>> @@ -2530,9 +2533,8 @@ restart:
>>  top_delta = ploop_top_delta(plo);
>>  sbl.head = sbl.tail = preq->aux_bio;
>>   
>> -/* Relocated data write required sync before BAT updatee */
>> -set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
>> -
>> +/* Relocated data write required sync before BAT updatee
>> + * this will happen inside index_update */
>>  if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>>  preq->eng_state = PLOOP_E_DATA_WBI;
>>  plo->st.bio_out++;
>> diff --git a/drivers/block/ploop/io_direct.c 
>> b/drivers/block/ploop/io_direct.c
>> index a6d83fe..d7ecd4a 100644
>> --- a/drivers/block/ploop/io_direct.c
>> +++ b/drivers/block/ploop/io_direct.c
>> @@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
>> preq,
>>  trace_submit(preq);
>>   
>>  preflush = !!(rw & REQ_FLUSH);
>> -
>> -if (test_and_clear_bit(PLOOP_REQ_F

[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-21 Thread Dmitry Monakhov

(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 58d7580..a6d83fe 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init(&bl);
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v3

2016-06-21 Thread Dmitry Monakhov

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

   ->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FLUSH
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
   not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH
- fix fua handling for dio_submit
- BUG_ON for REQ_FLUSH in kaio_page_write

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c   |  8 +---
 drivers/block/ploop/io_direct.c | 30 ++-
 drivers/block/ploop/io_kaio.c   | 23 +
 drivers/block/ploop/map.c   | 45 ++---
 include/linux/ploop/ploop.h | 19 +
 5 files changed, 60 insertions(+), 65 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..303eb70 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -83,28 +83,19 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int err;
struct bio_list_walk bw;
int preflush;
-   int postfua = 0;
+   int fua = 0;
int write = !!(rw & REQ_WRITE);
int bio_num;
 
trace_submit(preq);
 
preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   fua = !!(rw & REQ_FUA);
+   if (fua && ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, &preq->state);
+   fua = 0;
}
-
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
@@ -238,8 +229,10 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   /* Very unlikely, but correct.
+* TODO: Optimize postfua via DELAY_FLUSH for any req state */
+   if (unlikely(fua))
+   rw2 |= REQ_FUA;
 
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
@@ -1520,15 +1513,14

[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit

2016-06-21 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..58d7580 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,27 +517,31 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(&plo->lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
-   spin_unlock_irq(&plo->lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
+   spin_unlock_irq(&plo->lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && (preq->req_rw & REQ_FUA))
+   if (!err && force_sync)
err = io->ops->sync(io);
 
-   spin_lock_irq(&plo->lock);
-   io->io_count++;
-   spin_unlock_irq(&plo->lock);
+   if (!force_sync) {
+   spin_lock_irq(&plo->lock);
+   io->io_count++;
+   spin_unlock_irq(&plo->lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] ploop: fix barriers for ordinary requests

2016-06-22 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> The way how io_direct.c handles FLUSH|FUA: b1:FLUSH,b2,b3,b4,b5:FLUSH|FUA
> is completely wrong: to make sure that b1:FLUSH made effect we have to
> wait for its completion. Similarly, even if we're sure that FUA will be
> processed as post-FLUSH (also dubious!), we have to wait for completion
> b1..b4 to make sure that that flush will cover them.
>
> The patch fixes all these issues pretty simple: let's mark outgouing
> bio-s with FLUSH|FUA based on those flags in *corresponing* incoming
> bio-s.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |1 -
>  drivers/block/ploop/io_direct.c |   47 
> ---
>  2 files changed, 15 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2ef1449..6b5702f 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -498,7 +498,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
>   preq->req_sector = bio->bi_sector;
>   preq->req_size = bio->bi_size >> 9;
>   preq->req_rw = bio->bi_rw;
> - bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
Wow. I can't even imagine that we clear barrier flags from original bios
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = 0;
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 6ef9cd8..84c9a48 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -92,7 +92,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
>   int preflush;
>   int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
> - int bio_num;
>  
>   trace_submit(preq);
>  
> @@ -233,13 +232,13 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);
>   bw.bv_off += copy;
>   size -= copy >> 9;
>   sec += copy >> 9;
>   }
>   ploop_extent_put(em);
>  
> - bio_num = 0;
>   while (bl.head) {
>   struct bio * b = bl.head;
>   unsigned long rw2 = rw;
> @@ -255,11 +254,10 @@ flush_bio:
>   preflush = 0;
>   }
>   if (unlikely(postfua && !bl.head))
> - rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
> + rw2 |= REQ_FUA;
>  
>   ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
> - submit_bio(rw2, b);
> - bio_num++;
> + submit_bio(rw2 | b->bi_rw, b);
>   }
>  
>   ploop_complete_io_request(preq);
> @@ -567,7 +565,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   sector_t sec, end_sec, nsec, start, end;
>   struct bio_list_walk bw;
>   int err;
> - int preflush = !!(preq->req_rw & REQ_FLUSH);
>  
>   bio_list_init(&bl);
>  
> @@ -598,14 +595,17 @@ dio_submit_pad(struct ploop_io *io, struct 
> ploop_request * preq,
>   while (sec < end_sec) {
>   struct page * page;
>   unsigned int poff, plen;
> + bool zero_page;
>  
>   if (sec < start) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = start - sec;
>   if (plen > (PAGE_SIZE>>9))
>   plen = (PAGE_SIZE>>9);
>   } else if (sec >= end) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = end_sec - sec;
> @@ -614,6 +614,7 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   } else {
>   /* sec >= start && sec < end */
>   struct bio_vec * bv;
> + zero_page = false;
>  
>   if (sec == start) {
>   bw.cur = sbl->head;
> @@ -672,6 +673,10 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + /* Handle FLUSH here, dio_post_submit will handle FUA */
> + if (!zero_page)
> + bio->bi_rw |= bw.cur->bi_rw & REQ_FLUSH;
> +
>   bw.bv_off += (plen<<9);
>   BUG_ON(plen == 0);
>   sec += plen;
> @@ -688,13 +693,9 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - rw = sbl->head->bi_rw | WRITE;
> - if (unlikely(preflush)) {
> - rw |= REQ_FLUSH;
> - preflush = 0;
> - }
> + rw = preq->req_rw & ~(REQ_FLUSH | REQ_FUA);
>   ploop_acc_ff_out(preq->plo, rw | b->bi_rw);
> - submit_bio(rw, b);
> + submit_bio(rw | b->bi_rw, b);
This is useless statement

Re: [Devel] [PATCH rh7] ploop: fix barriers for ordinary requests

2016-06-22 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> The way how io_direct.c handles FLUSH|FUA: b1:FLUSH,b2,b3,b4,b5:FLUSH|FUA
> is completely wrong: to make sure that b1:FLUSH made effect we have to
> wait for its completion. Similarly, even if we're sure that FUA will be
> processed as post-FLUSH (also dubious!), we have to wait for completion
> b1..b4 to make sure that that flush will cover them.
>
> The patch fixes all these issues pretty simple: let's mark outgouing
> bio-s with FLUSH|FUA based on those flags in *corresponing* incoming
> bio-s.
One more thing please see below.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |1 -
>  drivers/block/ploop/io_direct.c |   47 
> ---
>  2 files changed, 15 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2ef1449..6b5702f 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -498,7 +498,6 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
> bio,
>   preq->req_sector = bio->bi_sector;
>   preq->req_size = bio->bi_size >> 9;
>   preq->req_rw = bio->bi_rw;
> - bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = 0;
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 6ef9cd8..84c9a48 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -92,7 +92,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
>   int preflush;
>   int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
> - int bio_num;
>  
>   trace_submit(preq);
>  
> @@ -233,13 +232,13 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + bio->bi_rw |= bw.cur->bi_rw & (REQ_FLUSH | REQ_FUA);
>   bw.bv_off += copy;
>   size -= copy >> 9;
>   sec += copy >> 9;
>   }
>   ploop_extent_put(em);
>  
> - bio_num = 0;
>   while (bl.head) {
>   struct bio * b = bl.head;
>   unsigned long rw2 = rw;
> @@ -255,11 +254,10 @@ flush_bio:
>   preflush = 0;
>   }
>   if (unlikely(postfua && !bl.head))
> - rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
> + rw2 |= REQ_FUA;
>  
>   ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
> - submit_bio(rw2, b);
> - bio_num++;
> + submit_bio(rw2 | b->bi_rw, b);
>   }
>  
>   ploop_complete_io_request(preq);
> @@ -567,7 +565,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   sector_t sec, end_sec, nsec, start, end;
>   struct bio_list_walk bw;
>   int err;
> - int preflush = !!(preq->req_rw & REQ_FLUSH);
>  
>   bio_list_init(&bl);
>  
> @@ -598,14 +595,17 @@ dio_submit_pad(struct ploop_io *io, struct 
> ploop_request * preq,
>   while (sec < end_sec) {
>   struct page * page;
>   unsigned int poff, plen;
> + bool zero_page;
>  
>   if (sec < start) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = start - sec;
>   if (plen > (PAGE_SIZE>>9))
>   plen = (PAGE_SIZE>>9);
>   } else if (sec >= end) {
> + zero_page = true;
>   page = ZERO_PAGE(0);
>   poff = 0;
>   plen = end_sec - sec;
> @@ -614,6 +614,7 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request 
> * preq,
>   } else {
>   /* sec >= start && sec < end */
>   struct bio_vec * bv;
> + zero_page = false;
>  
>   if (sec == start) {
>   bw.cur = sbl->head;
> @@ -672,6 +673,10 @@ flush_bio:
>   goto flush_bio;
>   }
>  
> + /* Handle FLUSH here, dio_post_submit will handle FUA */

submit_pad may be called w/o post_submit flag from here:
->dio_submit_alloc
  if (io->files.em_tree->_get_extent) {
   ->dio_fallocate
   ->dio_submit_pad
  ..
 }
> + if (!zero_page)
> + bio->bi_rw |= bw.cur->bi_rw & REQ_FLUSH;
> +
>   bw.bv_off += (plen<<9);
>   BUG_ON(plen == 0);
>   sec += plen;
> @@ -688,13 +693,9 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - rw = sbl->head->bi_rw | WRITE;
> - if (unlikely(preflush)) {
> - rw |= REQ_FLUSH;
> - preflush = 0;
> - }
> + rw = preq->req_rw & ~(REQ_FLUSH | REQ_FUA);
>   ploop_acc_ff_out(preq->plo, rw | b->bi_

[Devel] [RH7 PATCH 0/6] RFC ploop: Barrier fix patch set v3

2016-06-23 Thread Dmitry Monakhov


Here is 3'rd version of barrier fix patches based on recent fixes.
This is an RFC version. I do not have time to test it before tomorrow,
Max please review is briefly and tell be your oppinion about general idea.
Basic idea is to use post_submit state to issue empty FLUSH barrier in order
to complete FUA requests. This allow us to unify all engines (direct and kaio).

This makes FUA processing optimal:
SUBMIT:FUA   :W1{b1,b2,b3,b4..},WAIT,post_submit:FLUSH
SUBMIT_ALLOC:FUA :W1{b1,b2,b3,b4..},WAIT,post_submit:FLUSH, WBI:FUA
RELOC_S: R1, W2,WAIT,post_submit:FLUSH, WBI:FUA
RELOC_A: R1, W2,WAIT,post_submit:FLUSH, WBI:FUA, 
W1:NULLIFY,WAIT,post_submit:FLUSH


#POST_SUBMIT CHANGES:
ploop-generalize-post_submit-stage.patch
ploop-generalize-issue_flush.patch
ploop-add-delayed-flush-support.patch
ploop-io_kaio-support-PLOOP_REQ_DEL_FLUSH.patch
#RELOC_XXX FIXES
ploop-fixup-barrier-handling-during-relocation.patch
patch-ploop_state_debugging.patch.patch


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 2/6] ploop: generalize issue_flush

2016-06-23 Thread Dmitry Monakhov

Currently io->ops->issue_flush is called only from single place,
but it has potential to generic. Patch does not change actual logic,
but allow to call ->issue_flush from various places

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c   | 1 +
 drivers/block/ploop/io_direct.c | 1 -
 drivers/block/ploop/io_kaio.c   | 1 -
 3 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e8b0304..95e3067 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1989,6 +1989,7 @@ ploop_entry_request(struct ploop_request * preq)
if (preq->req_size == 0) {
if (preq->req_rw & REQ_FLUSH &&
!test_bit(PLOOP_REQ_FSYNC_DONE, &preq->state)) {
+   preq->eng_state = PLOOP_E_COMPLETE;
if (top_io->ops->issue_flush) {
top_io->ops->issue_flush(top_io, preq);
return;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index ec905b4..195d318 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1836,7 +1836,6 @@ static void dio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
bio->bi_private = preq;
 
atomic_inc(&preq->io_count);
-   preq->eng_state = PLOOP_E_COMPLETE;
ploop_acc_ff_out(io->plo, preq->req_rw | bio->bi_rw);
submit_bio(preq->req_rw, bio);
ploop_complete_io_request(preq);
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index de26319..bee2cee 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -951,7 +951,6 @@ static void kaio_issue_flush(struct ploop_io * io, struct 
ploop_request *preq)
 {
struct ploop_delta *delta = container_of(io, struct ploop_delta, io);
 
-   preq->eng_state = PLOOP_E_COMPLETE;
preq->req_rw &= ~REQ_FLUSH;
 
spin_lock_irq(&io->plo->lock);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 6/6] patch ploop_state_debugging.patch

2016-06-23 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 090cd2d..9bf8592 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1232,6 +1232,12 @@ static void ploop_complete_request(struct ploop_request 
* preq)
}
preq->bl.tail = NULL;
 
+   if (!preq->error) {
+   unsigned long state = READ_ONCE(preq->state);
+   WARN_ON(state & (PLOOP_REQ_POST_SUBMIT_FL|
+PLOOP_REQ_DEL_CONV_FL |
+PLOOP_REQ_DEL_FLUSH_FL ));
+   }
if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
if (preq->error)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 1/6] ploop: generalize post_submit stage

2016-06-23 Thread Dmitry Monakhov

Currently post_submit() used only for convert_unwritten_extents.
But post_submit() is good transition point where all submitted
data was completed by lower layer, and new state about to be processed.
Iyt is ideal point where we can perform transition actions
For example:
 io_direct: Convert unwritten extents
 io_direct: issue empty barrier bio in order to simulate postflush
 io_direct,io_kaio: queue to fsync queue
 Etc.

This patch does not change anything, but prepare post_submit for
more logic which will be added later.

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c   | 10 ++
 drivers/block/ploop/io_direct.c | 15 ---
 include/linux/ploop/ploop.h | 12 +++-
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e405232..e8b0304 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2351,10 +2351,12 @@ static void ploop_req_state_process(struct 
ploop_request * preq)
preq->prealloc_size = 0; /* only for sanity */
}
 
-   if (test_bit(PLOOP_REQ_POST_SUBMIT, &preq->state)) {
-   preq->eng_io->ops->post_submit(preq->eng_io, preq);
-   clear_bit(PLOOP_REQ_POST_SUBMIT, &preq->state);
+   if (test_and_clear_bit(PLOOP_REQ_POST_SUBMIT, &preq->state)) {
+   struct ploop_io *io = preq->eng_io;
+
preq->eng_io = NULL;
+   if (preq->eng_io->ops->post_submit(io, preq))
+   goto out;
}
 
 restart:
@@ -2633,7 +2635,7 @@ restart:
default:
BUG();
}
-
+out:
if (release_ioc) {
struct io_context * ioc = current->io_context;
current->io_context = saved_ioc;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index f1812fe..ec905b4 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -416,8 +416,8 @@ try_again:
}
 
preq->iblock = iblk;
-   preq->eng_io = io;
-   set_bit(PLOOP_REQ_POST_SUBMIT, &preq->state);
+   set_bit(PLOOP_REQ_DEL_CONV, &preq->state);
+   ploop_add_post_submit(io, preq);
dio_submit_pad(io, preq, sbl, size, em);
err = 0;
goto end_write;
@@ -501,7 +501,7 @@ end_write:
 }
 
 static void
-dio_post_submit(struct ploop_io *io, struct ploop_request * preq)
+dio_convert_extent(struct ploop_io *io, struct ploop_request * preq)
 {
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
@@ -540,6 +540,15 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
}
 }
 
+static int
+dio_post_submit(struct ploop_io *io, struct ploop_request * preq)
+{
+   if (test_and_clear_bit(PLOOP_REQ_DEL_CONV, &preq->state))
+   dio_convert_extent(io, preq);
+
+   return 0;
+}
+
 /* Submit the whole cluster. If preq contains only partial data
  * within the cluster, pad the rest of cluster with zeros.
  */
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 0fba25e..4c52a40 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -148,7 +148,7 @@ struct ploop_io_ops
  struct bio_list *sbl, iblock_t iblk, unsigned int 
size);
void(*submit_alloc)(struct ploop_io *, struct ploop_request *,
struct bio_list *sbl, unsigned int size);
-   void(*post_submit)(struct ploop_io *, struct ploop_request *);
+   int (*post_submit)(struct ploop_io *, struct ploop_request *);
 
int (*disable_merge)(struct ploop_io * io, sector_t isector, 
unsigned int len);
int (*fastmap)(struct ploop_io * io, struct bio *orig_bio,
@@ -471,6 +471,7 @@ enum
PLOOP_REQ_POST_SUBMIT, /* preq needs post_submit processing */
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
+   PLOOP_REQ_DEL_CONV,/* post_submit: conversion required */
PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
 };
 
@@ -479,6 +480,8 @@ enum
 #define PLOOP_REQ_RELOC_S_FL (1 << PLOOP_REQ_RELOC_S)
 #define PLOOP_REQ_DISCARD_FL (1 << PLOOP_REQ_DISCARD)
 #define PLOOP_REQ_ZERO_FL (1 << PLOOP_REQ_ZERO)
+#define PLOOP_REQ_POST_SUBMIT_FL (1 << PLOOP_REQ_POST_SUBMIT)
+#define PLOOP_REQ_DEL_CONV_FL (1 << PLOOP_REQ_DEL_CONV)
 
 enum
 {
@@ -767,6 +770,13 @@ static inline void ploop_entry_qlen_dec(struct 
ploop_request * preq)
preq->plo->read_sync_reqs--;
}
 }
+static inline
+void ploop_add_post_submit(struct ploop_io *io, struct ploop_request

[Devel] [RH7 PATCH 5/6] ploop: fixup barrier handling during relocation

2016-06-23 Thread Dmitry Monakhov

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
E_RELOC_NULLIFY:

   ->submit()

Once we have delayed_flush engine it is easy to implement correct scheme for
both engines.

E_RELOC_DATA_READ ->submit_allloc => wait->post_submit->issue_flush
E_DATA_WBI ->ploop_index_update with FUA
E_RELOC_NULLIFY ->submit: => wait->post_submit->issue_flush

This makes reloc sequence optimal:
RELOC_S: R1, W2,WAIT,FLUSH, WBI:FUA
RELOC_A: R1, W2,WAIT,FLUSH, WBI:FUA, W1:NULLIFY,WAIT, FLUSH

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c |  2 +-
 drivers/block/ploop/io_kaio.c |  3 +--
 drivers/block/ploop/map.c | 28 ++--
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 95e3067..090cd2d 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2533,7 +2533,7 @@ restart:
sbl.head = sbl.tail = preq->aux_bio;
 
/* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
+   preq->req_rw |= REQ_FUA;
 
if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index 5341fd5..5217ab4 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -72,8 +72,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
}
 
/* Convert requested fua to fsync */
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
-   test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state) ||
+   if (test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))
post_fsync = 1;
 
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 3a6365d..ef351fb 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -901,6 +901,8 @@ void ploop_index_update(struct ploop_request * preq)
int old_level;
struct page * page;
sector_t sec;
+   int fua = !!(preq->req_rw & REQ_FUA);
+   unsigned long state = READ_ONCE(preq->state);
 
/* No way back, we are going to initiate index write. */
 
@@ -954,12 +956,11 @@ void ploop_index_update(struct ploop_request * preq)
plo->st.map_single_writes++;
top_delta->ops->map_index(top_delta, m->mn_start, &sec);
/* Relocate requires consistent writes, mark such reqs appropriately */
-   if (test_bit(PLOOP_REQ_RELOC_A, &preq->state) ||
-   test_bit(PLOOP_REQ_RELOC_S, &preq->state))
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
-
-   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec,
- !!(preq->req_rw & REQ_FUA));
+   if (state & (PLOOP_REQ_RELOC_A_FL | PLOOP_REQ_RELOC_S_FL)) {
+   WARN_ON(state & PLOOP_REQ_DEL_FLUSH_FL);
+   fua = 1;
+   }
+   top_delta->io.ops->write_page(&top_delta->io, preq, page, sec, fua);
put_page(page);
return;
 
@@ -1063,7 +1064,7 @@ static void map_wb_complete_post_process(struct ploop_map 
*map,
 * (see dio_submit()). So fsync of EXT4 image doesnt help us.
 * We need to force sync of nullified blocks.
 */
-   set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
+   preq->req_rw |= REQ_FUA;
top_delta->io.ops->submit(&top_delta->io, preq, preq->req_rw,
  &sbl, preq->iblock, 1<cluster_log);
 }
@@ -1153,8 +1154,10 @@ static void map_wb_complete(struct map_node * m, int err)
 
list_for_each_safe(cursor, tmp, &m->io_queue) {
struct ploop_request * preq;
+   unsigned long state;

[Devel] [RH7 PATCH 4/6] ploop: io_kaio support PLOOP_REQ_DEL_FLUSH

2016-06-23 Thread Dmitry Monakhov

Currently noone tag preqs with such bit but let it be here for simmetry

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_kaio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
index bee2cee..5341fd5 100644
--- a/drivers/block/ploop/io_kaio.c
+++ b/drivers/block/ploop/io_kaio.c
@@ -73,6 +73,7 @@ static void kaio_complete_io_state(struct ploop_request * 
preq)
 
/* Convert requested fua to fsync */
if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
+   test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state) ||
test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state))
post_fsync = 1;
 
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 3/6] ploop: add delayed flush support

2016-06-23 Thread Dmitry Monakhov

dio_submit and dio_submit_pad may produce several bios. This makes
processing of REQ_FUA complicated because in order to preserve correctness
we have to TAG each bio with FUA flag which is suboptimal.
Obviously there is a room for optimization here: once all bios was acknowledged
by lower layer we may issue empty barrier aka ->issue_flush().
post_submit call back is the place where we all bios completed already.

b1:FUA, b2:FUA, b3:FUA =>  b1,b2,b3,wait_for_bios,bX:FLUSH

This allow us to remove all this REQ_FORCE_{FLUSH,FUA} crap and

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 48 +
 include/linux/ploop/ploop.h |  2 ++
 2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 195d318..752a9c3e 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -82,31 +82,13 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
sector_t sec, nsec;
int err;
struct bio_list_walk bw;
-   int preflush;
-   int postfua = 0;
+   int preflush = !!(rw & REQ_FLUSH);
+   int postflush = !!(rw & REQ_FUA);
int write = !!(rw & REQ_WRITE);
 
trace_submit(preq);
 
-   preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
-   /* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, &preq->state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
-   }
-
rw &= ~(REQ_FLUSH | REQ_FUA);
-
-
bio_list_init(&bl);
 
if (iblk == PLOOP_ZERO_INDEX)
@@ -237,13 +219,14 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= REQ_FUA;
-
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
}
-
+   /* TODO: minor optimization is possible for single bio case */
+   if (postflush) {
+   set_bit(PLOOP_REQ_DEL_FLUSH, &preq->state);
+   ploop_add_post_submit(io, preq);
+   }
ploop_complete_io_request(preq);
return;
 
@@ -523,9 +506,10 @@ dio_convert_extent(struct ploop_io *io, struct 
ploop_request * preq)
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && force_sync)
+   if (!err && force_sync) {
+   clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state);
err = io->ops->sync(io);
-
+   }
if (!force_sync) {
spin_lock_irq(&plo->lock);
io->io_count++;
@@ -546,7 +530,12 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
if (test_and_clear_bit(PLOOP_REQ_DEL_CONV, &preq->state))
dio_convert_extent(io, preq);
 
+   if (test_and_clear_bit(PLOOP_REQ_DEL_FLUSH, &preq->state)) {
+   io->ops->issue_flush(io, preq);
+   return 1;
+   }
return 0;
+
 }
 
 /* Submit the whole cluster. If preq contains only partial data
@@ -562,7 +551,6 @@ dio_submit_pad(struct ploop_io *io, struct ploop_request * 
preq,
sector_t sec, end_sec, nsec, start, end;
struct bio_list_walk bw;
int err;
-
bio_list_init(&bl);
 
/* sec..end_sec is the range which we are going to write */
@@ -694,7 +682,11 @@ flush_bio:
ploop_acc_ff_out(preq->plo, rw | b->bi_rw);
submit_bio(rw, b);
}
-
+   /* TODO: minor optimization is possible for single bio case */
+   if (preq->req_rw &  REQ_FUA) {
+   set_bit(PLOOP_REQ_DEL_FLUSH, &preq->state);
+   ploop_add_post_submit(io, preq);
+   }
ploop_complete_io_request(preq);
return;
 
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 4c52a40..5076f16 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -472,6 +472,7 @@ enum
PLOOP_REQ_PUSH_BACKUP, /* preq was ACKed by userspace push_backup */
PLOOP_REQ_ALLOW_READS, /* READs are allowed for given req_cluster */
PLOOP_REQ_DEL_CONV,/* post_submit: conversion required */
+   PLOOP_REQ_DEL_FLUSH,   /* post_submit: REQ_FLUSH required */
PLOOP_REQ_FSYNC_DONE,  /* fsync_thread() performed f_op->fsync() */
 };
 
@@ -482,6 +483,7 @@ enum

Re: [Devel] [PATCH rh7 0/9] ploop: fix barriers for reloc requests

2016-06-24 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> The series firstly fixes a few issues in handling
> barriers in ordinary requests (what was overlooked
> in previous patch -- see commit c2247f3745).
>
> Then there are a few minor rework w/o functional
> changes that alleviate main patches (last two ones).
>
> And finally the series fixes handling barriers
> for RELOC_A|S requests.
>
> The main complexity comes from the following bug:
> for direct_io it's not enough to send FUA to flush
> all nullified cluster block. See details in
> "fix barriers for PLOOP_E_RELOC_NULLIFY" patch.
>
Ok. Max I can not fully agree the way you orginize fix for RELOC bug
(especially for kaio). But it does all major things
1) Removes _FORCE_XXX crap
2) Cleanup barrier stuff
3) Fix RELOC_XXX code flow.

Let's keep style things aside for now, and commit that fix.
So ACK whole series. And let optimize/fix sylistic stuff leter.
> ---
>
> Dmitry Monakhov (3):
>   ploop: deadcode cleanup
>   ploop: minor rework of ->write_page() io method
>   ploop: generalize issue_flush
>
> Maxim Patlasov (6):
>   ploop: minor rework of ploop_req_delay_fua_possible
>   ploop: resurrect delayed_fua for io_kaio
>   ploop: resurrect delay_fua for io_direct
>   ploop: remove preflush from dio_submit
>   ploop: fix barriers for PLOOP_E_RELOC_NULLIFY
>   ploop: fixup barrier handling during relocation
>
>
>  drivers/block/ploop/dev.c   |   16 ++--
>  drivers/block/ploop/io_direct.c |   48 -
>  drivers/block/ploop/io_kaio.c   |   26 ++--
>  drivers/block/ploop/map.c   |   50 
> ---
>  include/linux/ploop/ploop.h |   20 +++-
>  5 files changed, 71 insertions(+), 89 deletions(-)
>
> --
> Signature


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 6/9] ploop: remove preflush from dio_submit

2016-06-24 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> After commit c2247f3745 fixing barriers for ordinary
> requests and previous patch fixing delay_fua,
> that legacy code in dio_submit processing
> (preq->req_rw & REQ_FLUSH) by setting REQ_FLUSH in
> the first outgoing bio must die: it is incorrect
> anyway (we don't wait for completion of the first
> bio before sending others).
Wow. This is so true. BTW: Reasonable way to handle FLUSH
is to queue such preq to preflush_queue similar to fsync_queue for
fsync_thread infrastructure

>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/io_direct.c |7 ---
>  1 file changed, 7 deletions(-)
>
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1ea2008..ee3cd5c 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -89,15 +89,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
> preq,
>   sector_t sec, nsec;
>   int err;
>   struct bio_list_walk bw;
> - int preflush;
>   int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
>   int delayed_fua = 0;
>  
>   trace_submit(preq);
>  
> - preflush = !!(rw & REQ_FLUSH);
> -
>   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
>   postfua = 1;
>  
> @@ -236,10 +233,6 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - if (unlikely(preflush)) {
> - rw2 |= REQ_FLUSH;
> - preflush = 0;
> - }
>   if (unlikely(postfua && !bl.head))
>   rw2 |= REQ_FUA;
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 9/9] ploop: fixup barrier handling during relocation

2016-06-24 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Rebase Dima's patch on top of rh7-3.10.0-327.18.2.vz7.14.19,
> but without help of delayed_flush engine:
>
> To ensure consistency on crash/power outage/hard reboot
> events, ploop must implement the following barrier logic
> for RELOC_A|S requests:
>
> 1) After we store data to new place, but before updating
> BAT on disk, we have FLUSH everything (in fact, flushing
> those data would be enough, but it is simplier to flush
> everything).
>
> 2) We should not proceed handling RELOC_A|S until we
> 100% sure new BAT value went to disk platters. So far as
> new BAT is only one page, it's OK to mark corresponding
> bio with FUA flag for io_direct case. For io_kaio, not
> having FUA api, we have to post_fsync BAT update.
>
> PLOOP_REQ_FORCE_FLUSH/PLOOP_REQ_FORCE_FUA introduced
> long time ago probably were intended to ensure the
> logic above, but they actually didn't.
>
> The patch removes PLOOP_REQ_FORCE_FLUSH/PLOOP_REQ_FORCE_FUA,
> and implements barriers in a straightforward and simple way:
> check for RELOC_A|S explicitly and make FLUSH/FUA where
> needed.
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |4 ++--
>  drivers/block/ploop/io_direct.c |7 ---
>  drivers/block/ploop/io_kaio.c   |8 +---
>  drivers/block/ploop/map.c   |   22 ++
>  include/linux/ploop/ploop.h |1 -
>  5 files changed, 17 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index 2b60dfa..40768b6 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -2610,8 +2610,8 @@ restart:
>   top_delta = ploop_top_delta(plo);
>   sbl.head = sbl.tail = preq->aux_bio;
>  
> - /* Relocated data write required sync before BAT updatee */
> - set_bit(PLOOP_REQ_FORCE_FUA, &preq->state);
> + /* Relocated data write required sync before BAT update
> +  * this will happen inside index_update */
>  
>   if (test_bit(PLOOP_REQ_RELOC_S, &preq->state)) {
>   preq->eng_state = PLOOP_E_DATA_WBI;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index c4d0f63..266f041 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -89,15 +89,11 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
> preq,
>   sector_t sec, nsec;
>   int err;
>   struct bio_list_walk bw;
> - int postfua = 0;
>   int write = !!(rw & REQ_WRITE);
>   int delayed_fua = 0;
>  
>   trace_submit(preq);
>  
> - if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state))
> - postfua = 1;
> -
>   if ((rw & REQ_FUA) && ploop_req_delay_fua_possible(preq)) {
>   /* Mark req that delayed flush required */
>   preq->req_rw |= (REQ_FLUSH | REQ_FUA);
> @@ -233,9 +229,6 @@ flush_bio:
>   b->bi_private = preq;
>   b->bi_end_io = dio_endio_async;
>  
> - if (unlikely(postfua && !bl.head))
> - rw2 |= REQ_FUA;
> -
>   ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
>   submit_bio(rw2, b);
>   }
> diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
> index ed550f4..85863df 100644
> --- a/drivers/block/ploop/io_kaio.c
> +++ b/drivers/block/ploop/io_kaio.c
> @@ -69,6 +69,8 @@ static void kaio_complete_io_state(struct ploop_request * 
> preq)
>   unsigned long flags;
>   int post_fsync = 0;
>   int need_fua = !!(preq->req_rw & REQ_FUA);
> + unsigned long state = READ_ONCE(preq->state);
> + int reloc = !!(state & (PLOOP_REQ_RELOC_A_FL|PLOOP_REQ_RELOC_S_FL));
>  
>   if (preq->error || !(preq->req_rw & REQ_FUA) ||
>   preq->eng_state == PLOOP_E_INDEX_READ ||
> @@ -80,9 +82,9 @@ static void kaio_complete_io_state(struct ploop_request * 
> preq)
>   }
>  
>   /* Convert requested fua to fsync */
> - if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, &preq->state) ||
> - test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state) ||
> - (need_fua && !ploop_req_delay_fua_possible(preq))) {
This is the change I dislike the most. io_XXX should not care it is
reloc or not. Caller should rule whenether PREFLUSH/POSTFLUSH should
happen before preq completes. So IMHO this is a crunch, but correct one.

> + if (test_and_clear_bit(PLOOP_REQ_KAIO_FSYNC, &preq->state) ||
> + (need_fua && !ploop_req_delay_fua_possible(preq)) ||
> + (reloc && ploop_req_delay_fua_possible(preq))) {
>   post_fsync = 1;
>   preq->req_rw &= ~REQ_FUA;
>   }
> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
> index 915a216..1883674 100644
> --- a/drivers/block/ploop/map.c
> +++ b/drivers/block/ploop/map.c
> @@ -909,6 +909,7 @@ void ploop_index_update(struct ploop_request * preq)
>   struct page * page;
>   s

[Devel] [RH7 PATCH] ploop: reloc vs extent_conversion race fix

2016-06-30 Thread Dmitry Monakhov

We have fixed most relocation bugs during fixing 
https://jira.sw.ru/browse/PSBM-47107

Currently reloc_a looks like follows:

 1->read_data_from_old_post
 2->write_to_new_pos
->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
 5->issue_flush

But on step 3 extent coversion is not yet stable because belongs to uncommitted
transaction. We MUST call ->fsync inside ->post_sumit as we do for REQ_FUA
requests. Let's tag relocatoin requests as FUA from very beginning in order to
assert sync semantics.

https://jira.sw.ru/browse/PSBM-49143
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 40768b6..e5f010b 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-06 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Commit 9f860e606 introduced an engine to delay fsync: doing
> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
> later, when incoming FLUSH|FUA comes.
>
> That was deemed as important because (PSBM-47026):
>
>> This optimization becomes more important due to the fact that customers tend 
>> to use pcompact heavily => ploop images grow each day.
>
> Now, we can easily re-use the engine to delay fsync for reloc
> requests as well. As explained in the description of commit
> 5aa3fe09:
>
>> 1->read_data_from_old_post
>> 2->write_to_new_pos
>>   ->sumbit_alloc
>>  ->submit_pad
>>  ->post_submit->convert_unwritten
>> 3->update_index ->write_page with FLUSH|FUA
>> 4->nullify_old_pos
>>5->issue_flush
>
> by the time of step 3 extent coversion is not yet stable because
> belongs to uncommitted transaction. But instead of doing fsync
> inside ->post_submit, we can fsync later, as the very first step
> of write_page for index_update.
NAK from me. What is advantage of this patch?
Does it makes code more optimal? No
Does it makes main ploop more asynchronous? No.

If you want to make optimization then it is reasonable to
queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
before processing PLOOP_E_DATA_WBI  state for  preq with FUA
So sequence will looks like follows:
->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
->ploop_req_state_process
  case PLOOP_E_DATA_WBI:
  if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
  preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
  list_add_tail(&preq->list, &top_io->fsync_queue)
  return;
   }
##Let fsync_thread do it's work
->ploop_req_state_process
   case LOOP_E_DATA_WBI:
   update_index->write_page with FUA (FLUSH is not required because we  already 
done fsync)

>
> https://jira.sw.ru/browse/PSBM-47026
>
> Signed-off-by: Maxim Patlasov 
> ---
>  drivers/block/ploop/dev.c   |4 ++--
>  drivers/block/ploop/io_direct.c |   25 -
>  drivers/block/ploop/io_kaio.c   |3 ++-
>  drivers/block/ploop/map.c   |   17 -
>  include/linux/ploop/ploop.h |3 ++-
>  5 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index e5f010b..40768b6 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
>   preq->bl.tail = preq->bl.head = NULL;
>   preq->req_cluster = 0;
>   preq->req_size = 0;
> - preq->req_rw = WRITE_SYNC|REQ_FUA;
> + preq->req_rw = WRITE_SYNC;
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
>   preq->error = 0;
> @@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
> *plo)
>   preq->bl.tail = preq->bl.head = NULL;
>   preq->req_cluster = ~0U; /* uninitialized */
>   preq->req_size = 0;
> - preq->req_rw = WRITE_SYNC|REQ_FUA;
> + preq->req_rw = WRITE_SYNC;
>   preq->eng_state = PLOOP_E_ENTRY;
>   preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
>   preq->error = 0;
> diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
> index 1086850..0a5fb15 100644
> --- a/drivers/block/ploop/io_direct.c
> +++ b/drivers/block/ploop/io_direct.c
> @@ -1494,13 +1494,36 @@ dio_read_page(struct ploop_io * io, struct 
> ploop_request * preq,
>  
>  static void
>  dio_write_page(struct ploop_io * io, struct ploop_request * preq,
> -struct page * page, sector_t sec, unsigned long rw)
> +struct page * page, sector_t sec, unsigned long rw,
> +int do_fsync_if_delayed)
>  {
>   if (!(io->files.file->f_mode & FMODE_WRITE)) {
>   PLOOP_FAIL_REQUEST(preq, -EBADF);
>   return;
>   }
>  
> + if (do_fsync_if_delayed &&
> + test_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state)) {
> + struct ploop_device * plo = io->plo;
> + u64 io_count;
> + int err;
> +
> + spin_lock_irq(&plo->lock);
> + io_count = io->io_count;
> + spin_unlock_irq(&plo->lock);
> +
> + err = io->ops->sync(io);
> + if (err) {
> + PLOOP_FAIL_REQUEST(preq, -EBADF);
> + return;
> + }
> +
> + spin_lock_irq(&plo->lock);
> + if (io_count == io->io_count && !(io_count & 1))
> + clear_bit(PLOOP_IO_FSYNC_DELAYED, &io->io_state);
> + spin_unlock_irq(&plo->lock);
> + }
> +
>   dio_io_page(io, rw | WRITE | REQ_SYNC, preq, page, sec);
>  }
>  
> diff --git a/drivers/block/ploop/io_kaio.c b/drivers/block/ploop/io_kaio.c
> in

[Devel] [PATCH] e2fsprogs: fixup resize issues (PSBM #49322)

2016-07-08 Thread Dmitry Monakhov

Backport mainstream commits:
c82815e resize2fs: disable the meta_bg feature if necessary
7a4352d e2fsck: fix file systems with an overly large s_first_meta_bg

TODO: update changelog
Signed-off-by: Dmitry Monakhov 
---
 ...size2fs-disable-the-meta_bg-feature-if-ne.patch | 63 +++
 ...file-systems-with-an-overly-large-s_first.patch | 70 ++
 e2fsprogs.spec |  6 +-
 3 files changed, 138 insertions(+), 1 deletion(-)
 create mode 100644 
e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch
 create mode 100644 
e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch

diff --git 
a/e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch 
b/e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch
new file mode 100644
index 000..e1ef136
--- /dev/null
+++ 
b/e2fsprogs-1.42.9-backport-resize2fs-disable-the-meta_bg-feature-if-ne.patch
@@ -0,0 +1,63 @@
+From 21045fee7b031db004aba818cc803e92937dbac0 Mon Sep 17 00:00:00 2001
+From: Theodore Ts'o 
+Date: Sat, 9 Aug 2014 12:33:11 -0400
+Subject: [PATCH 2/2] backport resize2fs: disable the meta_bg feature if
+ necessary From c82815e5097f130c8b926b3303a1e063a19dcdd0 Mon Sep 17 00:00:00
+ 2001 [PATCH] resize2fs: disable the meta_bg feature if necessary
+
+When shrinking a file system, if the number block groups drops below
+the point where we started using the meta_bg layout, disable the
+meta_bg feature and set s_first_meta_bg to zero.  This is necessary to
+avoid creating an invalid/corrupted file system after the shrink.
+
+Addresses-Debian-Bug: #756922
+
+Signed-off-by: Theodore Ts'o 
+Reported-by: Marcin Wolcendorf 
+Tested-by: Marcin Wolcendorf 
+Signed-off-by: Dmitry Monakhov 
+---
+ resize/resize2fs.c | 17 +
+ 1 file changed, 13 insertions(+), 4 deletions(-)
+
+diff --git a/resize/resize2fs.c b/resize/resize2fs.c
+index a8bbd7c..2dc16b8 100644
+--- a/resize/resize2fs.c
 b/resize/resize2fs.c
+@@ -462,6 +462,13 @@ retry:
+   fs->super->s_reserved_gdt_blocks = new;
+   }
+ 
++  if ((fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG) &&
++  (fs->super->s_first_meta_bg > fs->desc_blocks)) {
++  fs->super->s_feature_incompat &=
++  ~EXT2_FEATURE_INCOMPAT_META_BG;
++  fs->super->s_first_meta_bg = 0;
++  }
++
+   /*
+* If we are shrinking the number of block groups, we're done
+* and can exit now.
+@@ -947,13 +954,15 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
+   ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
+   }
+ 
+-  if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG) {
++  if (old_fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
+   old_blocks = old_fs->super->s_first_meta_bg;
+-  new_blocks = fs->super->s_first_meta_bg;
+-  } else {
++  else
+   old_blocks = old_fs->desc_blocks + 
old_fs->super->s_reserved_gdt_blocks;
++
++  if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
++  new_blocks = fs->super->s_first_meta_bg;
++  else
+   new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
+-  }
+ 
+   if (old_blocks == new_blocks) {
+   retval = 0;
+-- 
+1.8.3.1
+
diff --git 
a/e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch 
b/e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch
new file mode 100644
index 000..cdf2524
--- /dev/null
+++ 
b/e2fsprogs-1.42.9-e2fsck-fix-file-systems-with-an-overly-large-s_first.patch
@@ -0,0 +1,70 @@
+From 26a16ea9c97460711f1cbaf9e0a7333b8b27884d Mon Sep 17 00:00:00 2001
+From: Theodore Ts'o 
+Date: Thu, 7 Jul 2016 19:17:49 +0300
+Subject: [PATCH 1/2] e2fsck: fix file systems with an overly large
+ s_first_meta_bg
+
+Signed-off-by: Theodore Ts'o 
+Signed-off-by: Dmitry Monakhov 
+---
+ e2fsck/problem.c |  5 +
+ e2fsck/problem.h |  3 +++
+ e2fsck/super.c   | 12 
+ 3 files changed, 20 insertions(+)
+
+diff --git a/e2fsck/problem.c b/e2fsck/problem.c
+index 83584a0..431d7e7 100644
+--- a/e2fsck/problem.c
 b/e2fsck/problem.c
+@@ -438,6 +438,11 @@ static struct e2fsck_problem problem_table[] = {
+ N_("@S 64bit filesystems needs extents to access the whole disk.  "),
+ PROMPT_FIX, PR_PREEN_OK | PR_NO_OK},
+ 
++  /* The first_meta_bg is too big */
++  { PR_0_FIRST_META_BG_TOO_BIG,
++N_("First_meta_bg is too big.  (%N, max value %g).  "),
++PROMPT_CLEAR, 0 },
++
+   /* Pass 1 errors */
+ 
+   /* Pass 1: Checking inodes, blocks, and sizes */
+diff --git a/e2fsck/problem.h b/e2fsck/problem.h

[Devel] [PATCH] ext4: improve ext4lazyinit scalability

2016-07-15 Thread Dmitry Monakhov

ext4lazyinit is global thread. This thread performs itable initalization under

It basically does followes:
ext4_lazyinit_thread
  ->mutex_lock(&eli->li_list_mtx);
  ->ext4_run_li_request(elr)
->ext4_init_inode_table-> Do a lot of IO if list is large

And when new mounts/umount arrives they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(&ext4_li_info->li_list_mtx);
   ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Convinient user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. I one of filesystem was frozen lazyinit_thread will simply blocks
on sb_start_write() so other mount/umounts will suck forever.

This patch changes logic like follows:
- grap ->s_umount read sem before process new li_request after that it is safe
  to drop list_mtx because all callers of li_remove_requers are holds ->s_umount
  for write.
- li_thread skip frozen SB's

Locking:
Locking order is asserted by umout path like follows: s_umount ->li_list_mtx
so the only way to to grab ->s_mount inside li_thread is via down_read_trylock

https://jira.sw.ru/browse/PSBM-49658

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/super.c | 53 -
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3822a5a..0ee193f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2635,7 +2635,6 @@ static int ext4_run_li_request(struct ext4_li_request 
*elr)
sb = elr->lr_super;
ngroups = EXT4_SB(sb)->s_groups_count;
 
-   sb_start_write(sb);
for (group = elr->lr_next_group; group < ngroups; group++) {
gdp = ext4_get_group_desc(sb, group, NULL);
if (!gdp) {
@@ -2662,8 +2661,6 @@ static int ext4_run_li_request(struct ext4_li_request 
*elr)
elr->lr_next_sched = jiffies + elr->lr_timeout;
elr->lr_next_group = group + 1;
}
-   sb_end_write(sb);
-
return ret;
 }
 
@@ -2713,9 +2710,9 @@ static struct task_struct *ext4_lazyinit_task;
 static int ext4_lazyinit_thread(void *arg)
 {
struct ext4_lazy_init *eli = (struct ext4_lazy_init *)arg;
-   struct list_head *pos, *n;
struct ext4_li_request *elr;
unsigned long next_wakeup, cur;
+   LIST_HEAD(request_list);
 
BUG_ON(NULL == eli);
 
@@ -2728,21 +2725,43 @@ cont_thread:
mutex_unlock(&eli->li_list_mtx);
goto exit_thread;
}
-
-   list_for_each_safe(pos, n, &eli->li_request_list) {
-   elr = list_entry(pos, struct ext4_li_request,
-lr_request);
-
-   if (time_after_eq(jiffies, elr->lr_next_sched)) {
-   if (ext4_run_li_request(elr) != 0) {
-   /* error, remove the lazy_init job */
-   ext4_remove_li_request(elr);
-   continue;
+   list_splice_init(&eli->li_request_list, &request_list);
+   while (!list_empty(&request_list)) {
+   int err = 0;
+   int progress = 0;
+
+   elr = list_entry(request_list.next,
+struct ext4_li_request, lr_request);
+   list_move(request_list.next, &eli->li_request_list);
+   if (time_before(jiffies, elr->lr_next_sched)) {
+   if (time_before(elr->lr_next_sched, 
next_wakeup))
+   next_wakeup = elr->lr_next_sched;
+   continue;
+   }
+   if (down_read_trylock(&elr->lr_super->s_umount)) {
+   if (sb_start_write_trylock(elr->lr_super)) {
+   progress = 1;
+   /* We holds sb->s_umount, sb can not
+* be removed from the list, it is
+* now safe to drop li_list_mtx
+*/
+   mutex_unlock(&eli->li_list_mtx);
+   err = ext4_run_li_request(elr);
+   sb_end_write(elr->lr_super);
+   mutex_lock(&eli->li_list_mtx);
}
+   up_read((&elr->lr_super->s_umount));

Re: [Devel] [PATCH rh7] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests

2016-07-20 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Dima,
>
>
> I have not heard from you since 07/06/2016. Do you agree with that 
> reasoning I provided in last email? What's your objection against the 
> patch now?
Max, this patch looks ugly because it mix many things in one place.
In order to do this in right way let's introduce fsync-pended eng-state
where we can queue our requests and let fsync_thread will handle this.
Thre are three places where we need such functionality.

ENTRY: for req with FUA and IO_FSYNC_PENDED
PLOOP_E_DATA_WBI: for reqs with FUA
PLOOP_E_NULLIFY: 
PLOOP_E_COMPLETE: for reqs with FUA
Let's do it one once and it will works fine for all cases.
>
>
> Thanks,
>
> Maxim
>
>
> On 07/06/2016 11:10 AM, Maxim Patlasov wrote:
>> Dima,
>>
>> On 07/06/2016 04:58 AM, Dmitry Monakhov wrote:
>>
>>> Maxim Patlasov  writes:
>>>
>>>> Commit 9f860e606 introduced an engine to delay fsync: doing
>>>> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
>>>> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
>>>> later, when incoming FLUSH|FUA comes.
>>>>
>>>> That was deemed as important because (PSBM-47026):
>>>>
>>>>> This optimization becomes more important due to the fact that 
>>>>> customers tend to use pcompact heavily => ploop images grow each day.
>>>> Now, we can easily re-use the engine to delay fsync for reloc
>>>> requests as well. As explained in the description of commit
>>>> 5aa3fe09:
>>>>
>>>>>  1->read_data_from_old_post
>>>>>  2->write_to_new_pos
>>>>>->sumbit_alloc
>>>>>   ->submit_pad
>>>>>   ->post_submit->convert_unwritten
>>>>>  3->update_index ->write_page with FLUSH|FUA
>>>>>  4->nullify_old_pos
>>>>> 5->issue_flush
>>>> by the time of step 3 extent coversion is not yet stable because
>>>> belongs to uncommitted transaction. But instead of doing fsync
>>>> inside ->post_submit, we can fsync later, as the very first step
>>>> of write_page for index_update.
>>> NAK from me. What is advantage of this patch?
>>
>> The advantage is the following: in case of BAT multi-updates, instead 
>> of doing many fsync-s (one per dio_post_submit), we'll do only one 
>> (when final ->write_page is called).
>>
>>> Does it makes code more optimal? No
>>
>> Yes, it does. In the same sense as 9f860e606: saving some fsync-s.
>>
>>> Does it makes main ploop more asynchronous? No.
>>
>> Correct, the patch optimizes ploop in the other way. It's not about 
>> making ploop more asynchronous.
>>
>>
>>>
>>> If you want to make optimization then it is reasonable to
>>> queue preq with PLOOP_IO_FSYNC_DELAYED to top_io->fsync_queue
>>> before processing PLOOP_E_DATA_WBI  state for  preq with FUA
>>> So sequence will looks like follows:
>>> ->sumbit_alloc
>>>->submit_pad
>>>->post_submit->convert_unwritten-> tag PLOOP_IO_FSYNC_DELAYED
>>> ->ploop_req_state_process
>>>case PLOOP_E_DATA_WBI:
>>>if (preq->start & PLOOP_IO_FSYNC_DELAYED_FL) {
>>>preq->start &= ~PLOOP_IO_FSYNC_DELAYED_FL
>>>list_add_tail(&preq->list, &top_io->fsync_queue)
>>>return;
>>> }
>>> ##Let fsync_thread do it's work
>>> ->ploop_req_state_process
>>> case LOOP_E_DATA_WBI:
>>> update_index->write_page with FUA (FLUSH is not required because 
>>> we  already done fsync)
>>
>> That's another type of optimization: making ploop more asynchronous. I 
>> thought about it, but didn't come to conclusion whether it's worthy 
>> w.r.t. adding more complexity to ploop-state-machine and possible bugs 
>> introduced with that.
>>
>> Thanks,
>> Maxim
>>
>>>
>>>> https://jira.sw.ru/browse/PSBM-47026
>>>>
>>>> Signed-off-by: Maxim Patlasov 
>>>> ---
>>>>   drivers/block/ploop/dev.c   |4 ++--
>>>>   drivers/block/ploop/io_direct.c |   25 -
>>>>   drivers/block/ploop/io_kaio.c   |3 ++-
>>>>   drivers/block/ploop/map.c   |   17 -
>>>>   include/linux/ploop/ploop.h |3 ++-
>>>>   5 files changed, 42 inser

[Devel] [PATCH] ext4: fix broken fsync for dirs/symlink

2016-07-20 Thread Dmitry Monakhov

bad commit: 6a63db16da84fe

xfstests: generic/321 generic/335 generic/348
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c0e7acd..7e44850 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4919,8 +4919,8 @@ int ext4_force_commit(struct super_block *sb)
smp_rmb();
if (EXT4_SB(sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
return -EROFS;
-   }
return 0;
+   }
 
journal = EXT4_SB(sb)->s_journal;
return ext4_journal_force_commit(journal);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] ext4: fix broken mfsync_ioctl

2016-07-21 Thread Dmitry Monakhov

Fix obvious user->kmem memcoy typo

https://jira.sw.ru/browse/PSBM-49885
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/ioctl.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 4ef2876..7260d99 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -775,6 +775,7 @@ resize_out:
struct ext4_ioc_mfsync_info mfsync;
struct file **filpp;
unsigned int *flags;
+   __u32 __user *usr_fd;
int i, err;
 
if (copy_from_user(&mfsync, (struct ext4_ioc_mfsync_info *)arg,
@@ -784,6 +785,8 @@ resize_out:
}
if (mfsync.size == 0)
return 0;
+   usr_fd = (__u32 __user *) (arg + sizeof(__u32));
+
filpp = kzalloc(mfsync.size * sizeof(*filp), GFP_KERNEL);
if (!filpp)
return -ENOMEM;
@@ -797,12 +800,9 @@ resize_out:
int ret;
 
err = -EFAULT;
-   ret = get_user(fd, mfsync.fd + i);
-   if (ret) {
-   printk("%s:%d i:%d p:%p", __FUNCTION__, 
__LINE__,
-  i, mfsync.fd + i);
+   ret = get_user(fd, usr_fd + i);
+   if (ret)
goto mfsync_fput;
-   }
 
/* negative fd means fdata_sync */
flags[i] = (fd & (1<< 31)) != 0;
@@ -810,10 +810,8 @@ resize_out:
 
err = -EBADF;
filpp[i] = fget(fd);
-   if (!filpp[i]) {
-   printk("%s:%d", __FUNCTION__, __LINE__);
+   if (!filpp[i])
goto mfsync_fput;
-   }
}
err = ext4_sync_files(filpp, flags, mfsync.size);
 mfsync_fput:
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] Bug 124651 - ext4 bugon panic when I mmap a file

2016-07-25 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Dima,
>
>
> Just in case, does this:
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=124651
>
>
> affect us?
No. His testcase does not work 3.10.0-327.18.2.vz7.14.21
I've tested this like this:


signature.asc
Description: PGP signature
#! /bin/bash

#Testcase for https://bugzilla.kernel.org/show_bug.cgi?id=124651

echo Install debug info
yum install -y systemtap systemtap-runtime
yum install --enablerepo=virtuozzo-updates-debuginfo \
--enablerepo=virtuozzo-os-debuginfo fedora-source -y \
vzkernel-devel-$(uname -r) vzkernel-debuginfo-$(uname -r) \
vzkernel-debuginfo-common-$(uname -m)-$(uname -r) || exit 1

# Fetch source
curl https://bugzilla.kernel.org/attachment.cgi?id=224251 > /tmp/test.c || exit 1

# Original stap file not detect sb by default. So i've modified it.
base64 -d >/tmp/fail_ext4.stp <___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 3/3] ploop: io_direct: delay f_op->fsync() until index_update for reloc requests (v3)

2016-07-29 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Dima,
>
>
> One week elapsed, still no feedback from you. Do you have something 
> against this patch?
Sorry for delay Max. I was overloaded by pended crap I've collected
before vacations, and lost your email. Again sorry.

Whole patch looks good. Thank you for your rede

BTW: We defenitely need regression testing for original bug (broken
barries and others). I'm working on that.
>
>
> Thanks,
>
> Maxim
>
>
> On 07/20/2016 11:21 PM, Maxim Patlasov wrote:
>> Commit 9f860e606 introduced an engine to delay fsync: doing
>> fallocate(FALLOC_FL_CONVERT_UNWRITTEN) dio_post_submit marks
>> io as PLOOP_IO_FSYNC_DELAYED to ensure that fsync happens
>> later, when incoming FLUSH|FUA comes.
>>
>> That was deemed as important because (PSBM-47026):
>>
>>> This optimization becomes more important due to the fact that customers 
>>> tend to use pcompact heavily => ploop images grow each day.
>> Now, we can easily re-use the engine to delay fsync for reloc
>> requests as well. As explained in the description of commit
>> 5aa3fe09:
>>
>>>  1->read_data_from_old_post
>>>  2->write_to_new_pos
>>>->sumbit_alloc
>>>   ->submit_pad
>>>   ->post_submit->convert_unwritten
>>>  3->update_index ->write_page with FLUSH|FUA
>>>  4->nullify_old_pos
>>> 5->issue_flush
>> by the time of step 3 extent coversion is not yet stable because
>> belongs to uncommitted transaction. But instead of doing fsync
>> inside ->post_submit, we can fsync later, as the very first step
>> of write_page for index_update.
>>
>> Changed in v2:
>>   - process delayed fsync asynchronously, via PLOOP_E_FSYNC_PENDED eng_state
>>
>> Changed in v3:
>>   - use extra arg for ploop_index_wb_proceed_or_delay() instead of ad-hoc 
>> PLOOP_REQ_FSYNC_IF_DELAYED
>>
>> https://jira.sw.ru/browse/PSBM-47026
>>
>> Signed-off-by: Maxim Patlasov 
>> ---
>>   drivers/block/ploop/dev.c   |9 +++--
>>   drivers/block/ploop/map.c   |   32 
>>   include/linux/ploop/ploop.h |1 +
>>   3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
>> index df3eec9..ed60b1f 100644
>> --- a/drivers/block/ploop/dev.c
>> +++ b/drivers/block/ploop/dev.c
>> @@ -2720,6 +2720,11 @@ restart:
>>  ploop_index_wb_complete(preq);
>>  break;
>>   
>> +case PLOOP_E_FSYNC_PENDED:
>> +/* fsync done */
>> +ploop_index_wb_proceed(preq);
>> +break;
>> +
>>  default:
>>  BUG();
>>  }
>> @@ -4106,7 +4111,7 @@ static void ploop_relocate(struct ploop_device * plo)
>>  preq->bl.tail = preq->bl.head = NULL;
>>  preq->req_cluster = 0;
>>  preq->req_size = 0;
>> -preq->req_rw = WRITE_SYNC|REQ_FUA;
>> +preq->req_rw = WRITE_SYNC;
>>  preq->eng_state = PLOOP_E_ENTRY;
>>  preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
>>  preq->error = 0;
>> @@ -4410,7 +4415,7 @@ static void ploop_relocblks_process(struct 
>> ploop_device *plo)
>>  preq->bl.tail = preq->bl.head = NULL;
>>  preq->req_cluster = ~0U; /* uninitialized */
>>  preq->req_size = 0;
>> -preq->req_rw = WRITE_SYNC|REQ_FUA;
>> +preq->req_rw = WRITE_SYNC;
>>  preq->eng_state = PLOOP_E_ENTRY;
>>  preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
>>  preq->error = 0;
>> diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
>> index 5f7fd66..715dc15 100644
>> --- a/drivers/block/ploop/map.c
>> +++ b/drivers/block/ploop/map.c
>> @@ -915,6 +915,24 @@ void ploop_index_wb_proceed(struct ploop_request * preq)
>>  put_page(page);
>>   }
>>   
>> +static void ploop_index_wb_proceed_or_delay(struct ploop_request * preq,
>> +int do_fsync_if_delayed)
>> +{
>> +if (do_fsync_if_delayed) {
>> +struct map_node * m = preq->map;
>> +struct ploop_delta * top_delta = map_top_delta(m->parent);
>> +struct ploop_io * top_io = &top_delta->io;
>> +
>> +if (test_bit(PLOOP_IO_FSYNC_DELAYED, &top_io->io_state)) {
>> +preq->eng_state = PLOOP_E_FSYNC_PENDED;
>> +ploop_add_req_to_fsync_queue(preq);
>> +return;
>> +}
>> +}
>> +
>> +ploop_index_wb_proceed(preq);
>> +}
>> +
>>   /* Data write is commited. Now we need to update index. */
>>   
>>   void ploop_index_update(struct ploop_request * preq)
>> @@ -927,6 +945,7 @@ void ploop_index_update(struct ploop_request * preq)
>>  int old_level;
>>  struct page * page;
>>  unsigned long state = READ_ONCE(preq->state);
>> +int do_fsync_if_delayed = 0;
>>   
>>  /* No way back, we are going to initiate index write. */
>>   
>> @@ -985,10 +1004,12 @@ void ploop_index_update(struct ploop_request * preq)
>>  preq->req_rw &= ~REQ_FLUSH;
>>

[Devel] [PATCH] ext4: Discard preallocated block before swap_extents

2016-09-20 Thread Dmitry Monakhov

Inode preallocation consists of two parts (used and unused) fully controlled
by inode, so it must be discarded before swap extents.
Currently we may skip drop_preallocation if file is sparse.

This patch does:
- Moves ext4_discard_preallocations to ext4_swap_extents.
  This makes more readable and reliable for future changes.
- Cleanup main move_extent loop

xfstests:ext4/024 (pended: 
https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/extents.c |  2 ++
 fs/ext4/move_extent.c | 17 +
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d7ccb7f..757ffb8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5799,9 +5799,11 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
BUG_ON(!inode_is_locked(inode1));
BUG_ON(!inode_is_locked(inode2));
 
+   ext4_discard_preallocations(inode1);
*erp = ext4_es_remove_extent(inode1, lblk1, count);
if (unlikely(*erp))
return 0;
+   ext4_discard_preallocations(inode2);
*erp = ext4_es_remove_extent(inode2, lblk2, count);
if (unlikely(*erp))
return 0;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 6fc14de..24a9586 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -632,7 +632,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
 
ret = get_ext_path(orig_inode, o_start, &path);
if (ret)
-   goto out;
+   break;
ex = path[path->p_depth].p_ext;
next_blk = ext4_ext_next_allocated_block(path);
cur_blk = le32_to_cpu(ex->ee_block);
@@ -642,7 +642,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
if (next_blk == EXT_MAX_BLOCKS) {
o_start = o_end;
ret = -ENODATA;
-   goto out;
+   break;
}
d_start += next_blk - o_start;
o_start = next_blk;
@@ -654,7 +654,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
o_start = cur_blk;
/* Extent inside requested range ?*/
if (cur_blk >= o_end)
-   goto out;
+   break;
} else { /* in_range(o_start, o_blk, o_len) */
cur_len += cur_blk - o_start;
}
@@ -687,17 +687,10 @@ ext4_move_extents(struct file *o_filp, struct file 
*d_filp, __u64 orig_blk,
break;
o_start += cur_len;
d_start += cur_len;
+   *moved_len += cur_len;
}
-   *moved_len = o_start - orig_blk;
-   if (*moved_len > len)
-   *moved_len = len;
-
 out:
-   if (*moved_len) {
-   ext4_discard_preallocations(orig_inode);
-   ext4_discard_preallocations(donor_inode);
-   }
-
+   WARN_ON(*moved_len > len);
ext4_ext_drop_refs(path);
kfree(path);
ext4_double_up_write_data_sem(orig_inode, donor_inode);
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH] ext4: Discard preallocated block before swap_extents

2016-09-20 Thread Dmitry Monakhov

Dmitry Monakhov  writes:

TEST_LOG: 
http://autotest.qa.sw.ru/avocado/bob.qa.sw.ru/job-results/job-2016-09-20T20.37-9107ed4/html/results.html

> Inode preallocation consists of two parts (used and unused) fully controlled
> by inode, so it must be discarded before swap extents.
> Currently we may skip drop_preallocation if file is sparse.
>
> This patch does:
> - Moves ext4_discard_preallocations to ext4_swap_extents.
>   This makes more readable and reliable for future changes.
> - Cleanup main move_extent loop
>
> xfstests:ext4/024 (pended: 
> https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
> Signed-off-by: Dmitry Monakhov 
> ---
>  fs/ext4/extents.c |  2 ++
>  fs/ext4/move_extent.c | 17 +
>  2 files changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index d7ccb7f..757ffb8 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -5799,9 +5799,11 @@ ext4_swap_extents(handle_t *handle, struct inode 
> *inode1,
>   BUG_ON(!inode_is_locked(inode1));
>   BUG_ON(!inode_is_locked(inode2));
>  
> + ext4_discard_preallocations(inode1);
>   *erp = ext4_es_remove_extent(inode1, lblk1, count);
>   if (unlikely(*erp))
>   return 0;
> + ext4_discard_preallocations(inode2);
>   *erp = ext4_es_remove_extent(inode2, lblk2, count);
>   if (unlikely(*erp))
>   return 0;
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 6fc14de..24a9586 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -632,7 +632,7 @@ ext4_move_extents(struct file *o_filp, struct file 
> *d_filp, __u64 orig_blk,
>  
>   ret = get_ext_path(orig_inode, o_start, &path);
>   if (ret)
> - goto out;
> + break;
>   ex = path[path->p_depth].p_ext;
>   next_blk = ext4_ext_next_allocated_block(path);
>   cur_blk = le32_to_cpu(ex->ee_block);
> @@ -642,7 +642,7 @@ ext4_move_extents(struct file *o_filp, struct file 
> *d_filp, __u64 orig_blk,
>   if (next_blk == EXT_MAX_BLOCKS) {
>   o_start = o_end;
>   ret = -ENODATA;
> - goto out;
> + break;
>   }
>   d_start += next_blk - o_start;
>   o_start = next_blk;
> @@ -654,7 +654,7 @@ ext4_move_extents(struct file *o_filp, struct file 
> *d_filp, __u64 orig_blk,
>   o_start = cur_blk;
>   /* Extent inside requested range ?*/
>   if (cur_blk >= o_end)
> - goto out;
> + break;
>   } else { /* in_range(o_start, o_blk, o_len) */
>   cur_len += cur_blk - o_start;
>   }
> @@ -687,17 +687,10 @@ ext4_move_extents(struct file *o_filp, struct file 
> *d_filp, __u64 orig_blk,
>   break;
>   o_start += cur_len;
>   d_start += cur_len;
> + *moved_len += cur_len;
>   }
> - *moved_len = o_start - orig_blk;
> - if (*moved_len > len)
> - *moved_len = len;
> -
>  out:
> - if (*moved_len) {
> - ext4_discard_preallocations(orig_inode);
> - ext4_discard_preallocations(donor_inode);
> - }
> -
> + WARN_ON(*moved_len > len);
>   ext4_ext_drop_refs(path);
>   kfree(path);
>   ext4_double_up_write_data_sem(orig_inode, donor_inode);
> -- 
> 2.7.4


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH7] pfcache: hide trusted.pfcache from listxattr

2016-09-23 Thread Dmitry Monakhov

Pavel Tikhomirov  writes:

> In SyS_listxattr -> listxattr -> ext4_listxattr ->
> ext4_xattr_list_entries we choose list handler for
> each ext4_xattr_entry based on e_name_index, and as
> for trusted.pfcache index is EXT4_XATTR_INDEX_TRUSTED,
> we chouse ext4_xattr_trusted_list which prints xattr
> to the list.
>
> To hide our trusted.pfcache from list change e_name_index
> to new EXT4_XATTR_INDEX_TRUSTED_CSUM and thus use
> ext4_xattr_trusted_csum_list instead which won't put
> xattr to the returned list.
Why we want to hide it?
>
> Test:
>
> TEST_FILE=/vz/root/101/testfile
> TEST_SHA1=`sha1sum $TEST_FILE | awk '{print $1}'`
> setfattr -n trusted.pfcache -v $TEST_SHA1 $TEST_FILE
> setfattr -n trusted.test -v test $TEST_FILE
> getfattr -d -m trusted $TEST_FILE
>
> before patch it was listed:
>
> trusted.pfcache="da39a3ee5e6b4b0d3255bfef95601890afd80709"
> trusted.test="test"
>
> after - not:
>
> trusted.test="test"
>
> https://jira.sw.ru/browse/PSBM-52180
> Signed-off-by: Pavel Tikhomirov 
> ---
>  fs/ext4/pfcache.c | 28 ++--
>  fs/ext4/xattr.c   |  1 +
>  fs/ext4/xattr.h   |  1 +
>  3 files changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/pfcache.c b/fs/ext4/pfcache.c
> index ff2300b..5fc6d9f 100644
> --- a/fs/ext4/pfcache.c
> +++ b/fs/ext4/pfcache.c
> @@ -441,8 +441,8 @@ int ext4_load_data_csum(struct inode *inode)
>  {
>   int ret;
>  
> - ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME, EXT4_I(inode)->i_data_csum,
> + ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "", EXT4_I(inode)->i_data_csum,
>   EXT4_DATA_CSUM_SIZE);
>   if (ret < 0)
>   return ret;
> @@ -482,8 +482,8 @@ static int ext4_save_data_csum(struct inode *inode, u8 
> *csum)
>   if (ret)
>   return ret;
>  
> - return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME, EXT4_I(inode)->i_data_csum,
> + return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "", EXT4_I(inode)->i_data_csum,
>   EXT4_DATA_CSUM_SIZE, 0);
>  }
>  
> @@ -492,8 +492,8 @@ void ext4_load_dir_csum(struct inode *inode)
>   char value[EXT4_DIR_CSUM_VALUE_LEN];
>   int ret;
>  
> - ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
> -  EXT4_DATA_CSUM_NAME, value, sizeof(value));
> + ret = ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> +  "", value, sizeof(value));
>   if (ret == EXT4_DIR_CSUM_VALUE_LEN &&
>   !strncmp(value, EXT4_DIR_CSUM_VALUE, sizeof(value)))
>   ext4_set_inode_state(inode, EXT4_STATE_PFCACHE_CSUM);
> @@ -502,8 +502,8 @@ void ext4_load_dir_csum(struct inode *inode)
>  void ext4_save_dir_csum(struct inode *inode)
>  {
>   ext4_set_inode_state(inode, EXT4_STATE_PFCACHE_CSUM);
> - ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME,
> + ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "",
>   EXT4_DIR_CSUM_VALUE,
>   EXT4_DIR_CSUM_VALUE_LEN, 0);
>  }
> @@ -516,8 +516,8 @@ void ext4_truncate_data_csum(struct inode *inode, loff_t 
> pos)
>  
>   if (EXT4_I(inode)->i_data_csum_end < 0) {
>   WARN_ON(journal_current_handle());
> - ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> - EXT4_DATA_CSUM_NAME, NULL, 0, 0);
> + ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> + "", NULL, 0, 0);
>   ext4_close_pfcache(inode);
>   }
>   spin_lock(&inode->i_lock);
> @@ -658,8 +658,8 @@ static int ext4_xattr_trusted_csum_get(struct dentry 
> *dentry, const char *name,
>   return -EPERM;
>  
>   if (S_ISDIR(inode->i_mode))
> - return ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED,
> -   EXT4_DATA_CSUM_NAME, buffer, size);
> + return ext4_xattr_get(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> +   "", buffer, size);
>  
>   if (!S_ISREG(inode->i_mode))
>   return -ENODATA;
> @@ -717,8 +717,8 @@ static int ext4_xattr_trusted_csum_set(struct dentry 
> *dentry, const char *name,
>   else
>   return -EINVAL;
>  
> - return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
> -   EXT4_DATA_CSUM_NAME, value, size, flags);
> + return ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED_CSUM,
> +   "", value, size, flags);
>   }
>  
>   if (!S_ISREG(inode->i_mode))
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 5dabf58..81b5534 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -102,6 +102,7 @@ static const struc

[Devel] [PATCH 2/2] xfs: compile for 661c0b9b3

2016-11-10 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_buf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e379876..28ad0bf 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1582,7 +1582,7 @@ xfs_buftarg_wait_rele(
 
 {
struct xfs_buf  *bp = container_of(item, struct xfs_buf, b_lru);
-
+   struct xfs_buftarg  *btp = bp->b_target;
/*
 * First wait on the buftarg I/O count for all in-flight buffers to be
 * released. This is critical as new buffers do not make the LRU until
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/2] ms/xfs: convert dquot cache lru to list_lru part2

2016-11-10 Thread Dmitry Monakhov

Modify patch according to MS change-set:ff6d6af2351 which requires that 
XFS_STATS_XXX()
has two arguments.

Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_qm.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 1b383f5..a0518a8 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -478,11 +478,11 @@ xfs_qm_dquot_isolate(
 */
if (dqp->q_nrefs) {
xfs_dqunlock(dqp);
-   XFS_STATS_INC(xs_qm_dqwants);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants);
 
trace_xfs_dqreclaim_want(dqp);
list_lru_isolate(lru, &dqp->q_lru);
-   XFS_STATS_DEC(xs_qm_dquot_unused);
+   XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);
return LRU_REMOVED;
}
 
@@ -526,19 +526,19 @@ xfs_qm_dquot_isolate(
 
ASSERT(dqp->q_nrefs == 0);
list_lru_isolate_move(lru, &dqp->q_lru, &isol->dispose);
-   XFS_STATS_DEC(xs_qm_dquot_unused);
+   XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);
trace_xfs_dqreclaim_done(dqp);
-   XFS_STATS_INC(xs_qm_dqreclaims);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaims);
return LRU_REMOVED;
 
 out_miss_busy:
trace_xfs_dqreclaim_busy(dqp);
-   XFS_STATS_INC(xs_qm_dqreclaim_misses);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
return LRU_SKIP;
 
 out_unlock_dirty:
trace_xfs_dqreclaim_busy(dqp);
-   XFS_STATS_INC(xs_qm_dqreclaim_misses);
+   XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
xfs_dqunlock(dqp);
spin_lock(lru_lock);
return LRU_RETRY;
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] scsi-DBG: make scsi error laud

2016-11-11 Thread Dmitry Monakhov

This patch is not for release, testing purpose only.
We need it in order to investigate #PSBM-54665

Signed-off-by: Dmitry Monakhov 

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 287045b..7364d86 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -141,12 +141,13 @@ int scsi_host_set_state(struct Scsi_Host *shost, enum 
scsi_host_state state)
return 0;
 
  illegal:
-   SCSI_LOG_ERROR_RECOVERY(1,
-   shost_printk(KERN_ERR, shost,
-"Illegal host state transition"
-"%s->%s\n",
-scsi_host_state_name(oldstate),
-scsi_host_state_name(state)));
+   shost_printk(KERN_ERR, shost,
+"Illegal host state transition"
+"%s->%s\n",
+scsi_host_state_name(oldstate),
+scsi_host_state_name(state));
+   dump_stack();
+
return -EINVAL;
 }
 EXPORT_SYMBOL(scsi_host_set_state);
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] scsi: make scsi error laud

2016-11-11 Thread Dmitry Monakhov

This patch is not for release, testing purpose only.
We need it in order to investigate #PSBM-54665

Signed-off-by: Dmitry Monakhov 

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 287045b..7364d86 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -141,12 +141,13 @@ int scsi_host_set_state(struct Scsi_Host *shost, enum 
scsi_host_state state)
return 0;
 
  illegal:
-   SCSI_LOG_ERROR_RECOVERY(1,
-   shost_printk(KERN_ERR, shost,
-"Illegal host state transition"
-"%s->%s\n",
-scsi_host_state_name(oldstate),
-scsi_host_state_name(state)));
+   shost_printk(KERN_ERR, shost,
+"Illegal host state transition"
+"%s->%s\n",
+scsi_host_state_name(oldstate),
+scsi_host_state_name(state));
+   dump_stack();
+
return -EINVAL;
 }
 EXPORT_SYMBOL(scsi_host_set_state);
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 573574b..c2e3307 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -61,6 +61,13 @@ struct virtio_scsi_vq {
struct virtqueue *vq;
 };
 
+#define __check_ret(val) do {  \
+   if (val == FAILED) {\
+   printk("virtscsi_failure"); \
+   dump_stack();   \
+   }   \
+   } while(0)
+
 /*
  * Per-target queue state.
  *
@@ -489,6 +496,7 @@ static int virtscsi_add_cmd(struct virtqueue *vq,
return virtqueue_add_sgs(vq, sgs, out_num, in_num, cmd, GFP_ATOMIC);
 }
 
+
 static int virtscsi_kick_cmd(struct virtio_scsi_vq *vq,
 struct virtio_scsi_cmd *cmd,
 size_t req_size, size_t resp_size)
@@ -633,6 +641,7 @@ static int virtscsi_tmf(struct virtio_scsi *vscsi, struct 
virtio_scsi_cmd *cmd)
virtscsi_poll_requests(vscsi);
 
 out:
+   __check_ret(ret);
mempool_free(cmd, virtscsi_cmd_pool);
return ret;
 }
@@ -644,8 +653,10 @@ static int virtscsi_device_reset(struct scsi_cmnd *sc)
 
sdev_printk(KERN_INFO, sc->device, "device reset\n");
cmd = mempool_alloc(virtscsi_cmd_pool, GFP_NOIO);
-   if (!cmd)
+   if (!cmd) {
+   __check_ret(FAILED);
return FAILED;
+   }
 
memset(cmd, 0, sizeof(*cmd));
cmd->sc = sc;
@@ -666,11 +677,12 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
struct virtio_scsi *vscsi = shost_priv(sc->device->host);
struct virtio_scsi_cmd *cmd;
 
-   scmd_printk(KERN_INFO, sc, "abort\n");
+   scmd_printk(KERN_INFO, sc, "%s abort\n", __FUNCTION__);
cmd = mempool_alloc(virtscsi_cmd_pool, GFP_NOIO);
-   if (!cmd)
+   if (!cmd) {
+   __check_ret(FAILED);
return FAILED;
-
+   }
memset(cmd, 0, sizeof(*cmd));
cmd->sc = sc;
cmd->req.tmf = (struct virtio_scsi_ctrl_tmf_req){
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] drop: ext4: resplit block_page_mkwrite: fix get-host convention

2016-11-18 Thread Dmitry Monakhov


We no longer needed vzfs crunches:
Please drop this patch:
ext4: resplit block_page_mkwrite: fix get-host convention
commit c97eaffbf6c9b909e324c59380962158185639bf
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: relax i_mutex coverage in fuse_fsync

2016-12-01 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Alexey,
>
>
> You're right. And while composing the patch I well understood that it's 
> possible to rework fuse_sync_writes() using a counter instead of 
> negative bias. But the problem with flush_mtime still exists anyway. 
> Think about it: we firstly acquire local mtime from local inode, then 
> fill and submit mtime-update-request. Since then, we don't know when 
> exactly fuse daemon will apply that new mtime to its metadata 
> structures. If another mtime-update is generated in-between (e.g. "touch 
> -d  file", or even simplier -- just a single direct write 
> implicitly updating mtime), we wouldn't know which of those two 
> mtime-update-requests are processed by fused first. That comes from a 
> general FUSE protocol limitation: when kernel fuse queues request A, 
> then request B, it cannot be sure if they will be processed by userspace 
> as  or .
>
>
> The big advantage of the patch I sent is that it's very simple, 
> straightforward and presumably will remove 99% of contention between 
> fsync and io_submit (assuming we spend most of time waiting for 
> userspace ACK for FUSE_FSYNC request. There are actually three questions 
> to answer:

>
>
> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
> lot of io_submits? The goal of fsync is to ensure that some state is 
> actually went to platters. An app who races io_submit-s with fsync-s 
> actually doesn't care which state will come to platters. I'm not sure 
> that it's reasonable to work very hard to achieve the best possible 
> performance for such a marginal app.
Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of 
io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.
But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file. For general filesystem (xfs/ext4) we grab i_mutex only on write
path, fsync is lockless. But int case of fuse we artificially introduce
i_mutex inside fsync which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.

BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.

>
>
> 2) Will the patch (in the form I sent it) break something? I think no. 
> If you know some usecase that can be broken, let's discuss it in more 
> details.
>
>
> 3) Should we expect some noticeable (or significant) improvement in 
> performance comparing fuse_fsync with no locking at all vs. the locking 
> we have with that patch applied? I tend to think that the answer is "no" 
> because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
> you disagree, let's firstly measure that difference in performance 
> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
> start to think if it's really worthy to fully re-work locking scheme to 
> preserve flush_mtime correctness w/o i_mutex.
>
>
> Thanks,
>
> Maxim
>
>
> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>> can be done only under i_mutex.
>>
>> IMHO it is only due to bad implementation.
>> If fuse_set_nowrite would be done with separate
>> count instead of adding negative bias, it would
>> be possible.
>>
>>
>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov  
>> wrote:
>>> Hello!
>>>
>>> I do not think you got it right.
>>>
>>> i_mutex in fsync is not about some atomicity,
>>> it is about stopping data feed while fsync is executed
>>> to prevent livelock.
>>>
>>> I cannot tell anything about mtime update, it is just some voodoo
>>> magic for me.
>>>
>>> What's about fsync semantics, I see two different ways:
>>>
>>> A.
>>>
>>> 1. Remove useless write_inode_now. Its work is done
>>>  by filemap_write_and_wait_range(), there is no need to repeat it
>>> under mutex.
>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>  fuse continuation forfilemap_write_and_wait_range().
>>> 3. i_mutex is preserved only around fsync call.
>>>
>>> B.
>>> 1. Remove  write_inode_now as well.
>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>  and add release after call to FSYNC.
>>>
>>> Both prevent livelock. B is obviosly optimal.
>>>
>>> But A preserves historic fuse protocol semantics.
>>> F.e. I have no idea would user space survive truncate
>>> racing with fsync. pstorage should survice, though this
>>> path was never tested.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov  
>>> wrote:
 fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
 fuse_flush_mtime(). But when those operations are done, it's actual

Re: [Devel] [PATCH vz7] fuse: no mtime flush on fdatasync

2016-12-02 Thread Dmitry Monakhov


Maxim Patlasov  writes:

> fuse_fsync_common() may skip fuse_flush_mtime() if datasync=1 because
> mtime is pure metadata and the content of file doesn't depend on it.
>
> https://jira.sw.ru/browse/PSBM-55919
>
> Signed-off-by: Maxim Patlasov 
ACK.
> ---
>  fs/fuse/file.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 559dfd9..e5c4778 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -684,8 +684,8 @@ int fuse_fsync_common(struct file *file, loff_t start, 
> loff_t end,
>   if (err)
>   goto out;
>  
> - if (test_bit(FUSE_I_MTIME_UPDATED,
> -  &get_fuse_inode(inode)->state)) {
> + if (!datasync && test_bit(FUSE_I_MTIME_UPDATED,
> +   &get_fuse_inode(inode)->state)) {
>   err = fuse_flush_mtime(file, false);
>   if (err)
>   goto out;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH7] vfs: add warning in guard_bio_eod() if truncated_bytes > bvec->bv_len

2016-12-03 Thread Dmitry Monakhov


Pavel Tikhomirov  writes:

> https://jira.sw.ru/browse/PSBM-55105
>
> In bug we crashed in zero_fill_bio when trying to zero memset bio_vec:
>
> struct bio_vec {
>   bv_page = 0xea0004437500,
>   bv_len = 4294948864,
>   bv_offset = 0
> }
>
> which is bigger than its bio->bi_size = 104448, guard_bio_eod might
> lead to these bv_len overflow and is suspicious as quiet recently
> in vz7.19.4 we've ported commit 2573b2539875("vfs: make guard_bh_eod()
> more generic") which adds bv_len reduction, and before that there
> were no crash.
>
> Signed-off-by: Pavel Tikhomirov 
> ---
>  fs/buffer.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index c45200d..b820080 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3009,6 +3009,7 @@ void guard_bio_eod(int rw, struct bio *bio)
>  
>   /* Truncate the bio.. */
>   bio->bi_size -= truncated_bytes;
> + WARN_ON(truncated_bytes > bvec->bv_len);
BUG_ON would be more appropriate here.
>   bvec->bv_len -= truncated_bytes;
>  
>   /* ..and clear the end of the buffer for reads */
> -- 
> 2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/2] fs/ceph: honor kernel direct aio changes v2

2016-12-05 Thread Dmitry Monakhov

Base patches:
fs/ceph: honor kernel direct aio changes
fs/ceph: add BUG_ON to iov_iter access

Changes: replace opencoded iter to iovec coversion with propper helper.
Signed-off-by: Dmitry Monakhov 
---
 fs/ceph/file.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 82676fa..0b72417 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -40,8 +40,8 @@
  */
 static size_t dio_get_pagev_size(const struct iov_iter *it)
 {
-const struct iovec *iov = it->iov;
-const struct iovec *iovend = iov + it->nr_segs;
+const struct iovec *iov = iov_iter_iovec(it);
+size_t total = iov_iter_count(it);
 size_t size;
 
 size = iov->iov_len - it->iov_offset;
@@ -50,8 +50,10 @@ static size_t dio_get_pagev_size(const struct iov_iter *it)
  * and the next base are page aligned.
  */
 while (PAGE_ALIGNED((iov->iov_base + iov->iov_len)) &&
-   (++iov < iovend && PAGE_ALIGNED((iov->iov_base {
-size += iov->iov_len;
+   PAGE_ALIGNED(((iov++)->iov_base))) {
+   size_t n =  min(iov->iov_len, total);
+   size += n;
+   total -= n;
 }
 dout("dio_get_pagevlen len = %zu\n", size);
 return size;
@@ -71,7 +73,7 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t nbytes,
struct page **pages;
int ret = 0, idx, npages;
 
-   align = (unsigned long)(it->iov->iov_base + it->iov_offset) &
+   align = (unsigned long)(iov_iter_iovec(it)->iov_base + it->iov_offset) &
(PAGE_SIZE - 1);
npages = calc_pages_for(align, nbytes);
pages = kmalloc(sizeof(*pages) * npages, GFP_KERNEL);
@@ -82,10 +84,11 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t 
nbytes,
}
 
for (idx = 0; idx < npages; ) {
-   void __user *data = tmp_it.iov->iov_base + tmp_it.iov_offset;
+   struct iovec *tmp_iov = iov_iter_iovec(&tmp_it);
+   void __user *data = tmp_iov->iov_base + tmp_it.iov_offset;
size_t off = (unsigned long)data & (PAGE_SIZE - 1);
size_t len = min_t(size_t, nbytes,
-  tmp_it.iov->iov_len - tmp_it.iov_offset);
+  tmp_iov->iov_len - tmp_it.iov_offset);
int n = (len + off + PAGE_SIZE - 1) >> PAGE_SHIFT;
ret = get_user_pages_fast((unsigned long)data, n, write,
   pages + idx);
@@ -522,10 +525,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct 
iov_iter *i,
size_t left = len = ret;
 
while (left) {
-   void __user *data = i->iov[0].iov_base +
-   i->iov_offset;
-   l = min(i->iov[0].iov_len - i->iov_offset,
-   left);
+   struct iovec *iov = (struct iovec *)i->data;
+   void __user *data = iov->iov_base + i->iov_offset;
+   l = min(iov->iov_len - i->iov_offset, left);
 
ret = ceph_copy_page_vector_to_user(&pages[k],
data, off, l);
@@ -1121,7 +1123,7 @@ static ssize_t inline_to_iov(struct kiocb *iocb, struct 
iov_iter *i,
 
while (left) {
struct iovec *iov = iov_iter_iovec(i);
-   void __user *udata = iov->iov_base + i->iov_offset;
+   void __user *udata = iov->iov_base;
size_t n = min(iov->iov_len - i->iov_offset, left);
 
if (__copy_to_user(udata, kdata, n)) {
@@ -1139,8 +1141,8 @@ static ssize_t inline_to_iov(struct kiocb *iocb, struct 
iov_iter *i,
size_t left = min_t(loff_t, iocb->ki_pos + len, i_size) - pos;
 
while (left) {
-   struct iovec *iov = iov_iter_iovec(i);
-   void __user *udata = iov->iov_base + i->iov_offset;
+   struct iovec *iov = (struct iovec *)i->data;
+   void __user *udata = iov->iov_base;
size_t n = min(iov->iov_len - i->iov_offset, left);
 
if (__clear_user(udata, n)) {
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/2] fs: constify iov_iter_count/iov_iter_iovec helpers

2016-12-05 Thread Dmitry Monakhov

Signed-off-by: Dmitry Monakhov 
---
 include/linux/fs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e30e8a1..a27bd15 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -448,13 +448,13 @@ static inline int iov_iter_has_iovec(const struct 
iov_iter *i)
 {
return i->ops == &ii_iovec_ops;
 }
-static inline struct iovec *iov_iter_iovec(struct iov_iter *i)
+static inline struct iovec *iov_iter_iovec(const struct iov_iter *i)
 {
BUG_ON(!iov_iter_has_iovec(i));
return (struct iovec *)i->data;
 }
 
-static inline size_t iov_iter_count(struct iov_iter *i)
+static inline size_t iov_iter_count(const struct iov_iter *i)
 {
return i->count;
 }
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 0/4] [7.3] rebase xfs lru patches

2016-12-06 Thread Dmitry Monakhov

rh7-3.10.0-514 already has 'fs-xfs-rework-buffer-dispose-list-tracking', but
originally it depens on ms/xfs-convert-buftarg-LRU-to-generic, so
In order to preserve original logic I've revert rhel's patch (1'st one),
and reapply it later in natural order:
TOC:
0001-Revert-fs-xfs-rework-buffer-dispose-list-tracking.patch

0002-ms-xfs-convert-buftarg-LRU-to-generic-code.patch
0003-From-c70ded437bb646ace0dcbf3c7989d4edeed17f7e-Mon-Se.patch [not changed]
0004-ms-xfs-rework-buffer-dispose-list-tracking.patch
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 4/4] ms/xfs: rework buffer dispose list tracking

2016-12-06 Thread Dmitry Monakhov

In converting the buffer lru lists to use the generic code, the locking
for marking the buffers as on the dispose list was lost.  This results in
confusion in LRU buffer tracking and acocunting, resulting in reference
counts being mucked up and filesystem beig unmountable.

To fix this, introduce an internal buffer spinlock to protect the state
field that holds the dispose list information.  Because there is now
locking needed around xfs_buf_lru_add/del, and they are used in exactly
one place each two lines apart, get rid of the wrappers and code the logic
directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and repeat
the walk that fills the dispose list until the LRU is empty.  Thi avoids
needing to drop and regain the LRU lock for every item being freed, and
allows the same logic as the shrinker isolate call to be used.  Simpler,
easier to understand.

Signed-off-by: Dave Chinner 
Signed-off-by: Glauber Costa 
Cc: "Theodore Ts'o" 
Cc: Adrian Hunter 
Cc: Al Viro 
Cc: Artem Bityutskiy 
Cc: Arve Hjønnevåg 
Cc: Carlos Maiolino 
Cc: Christoph Hellwig 
Cc: Chuck Lever 
Cc: Daniel Vetter 
Cc: David Rientjes 
Cc: Gleb Natapov 
Cc: Greg Thelen 
Cc: J. Bruce Fields 
Cc: Jan Kara 
Cc: Jerome Glisse 
Cc: John Stultz 
Cc: KAMEZAWA Hiroyuki 
Cc: Kent Overstreet 
Cc: Kirill A. Shutemov 
Cc: Marcelo Tosatti 
Cc: Mel Gorman 
Cc: Steven Whitehouse 
Cc: Thomas Hellstrom 
Cc: Trond Myklebust 
Signed-off-by: Andrew Morton 
Signed-off-by: Al Viro 
(cherry picked from commit a408235726aa82c0358c9ec68124b6f4bc0a79df)
Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_buf.c | 147 +++
 fs/xfs/xfs_buf.h |   8 ++-
 2 files changed, 78 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf933d5..8d8c9ce 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,37 +80,6 @@ xfs_buf_vmap_len(
 }
 
 /*
- * xfs_buf_lru_add - add a buffer to the LRU.
- *
- * The LRU takes a new reference to the buffer so that it will only be freed
- * once the shrinker takes the buffer off the LRU.
- */
-static void
-xfs_buf_lru_add(
-   struct xfs_buf  *bp)
-{
-   if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
-   bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
-   atomic_inc(&bp->b_hold);
-   }
-}
-
-/*
- * xfs_buf_lru_del - remove a buffer from the LRU
- *
- * The unlocked check is safe here because it only occurs when there are not
- * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
- * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free().
- */
-static void
-xfs_buf_lru_del(
-   struct xfs_buf  *bp)
-{
-   list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
-}
-
-/*
  * Bump the I/O in flight count on the buftarg if we haven't yet done so for
  * this buffer. The count is incremented once per buffer (per hold cycle)
  * because the corresponding decrement is deferred to buffer release. Buffers
@@ -181,12 +150,14 @@ xfs_buf_stale(
 */
xfs_buf_ioacct_dec(bp);
 
-   atomic_set(&(bp)->b_lru_ref, 0);
-   if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+   spin_lock(&bp->b_lock);
+   atomic_set(&bp->b_lru_ref, 0);
+   if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
(list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
atomic_dec(&bp->b_hold);
 
ASSERT(atomic_read(&bp->b_hold) >= 1);
+   spin_unlock(&bp->b_lock);
 }
 
 static int
@@ -987,10 +958,28 @@ xfs_buf_rele(
/* the last reference has been dropped ... */
xfs_buf_ioacct_dec(bp);
if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
-   xfs_buf_lru_add(bp);
+   /*
+* If the buffer is added to the LRU take a new
+* reference to the buffer for the LRU and clear the
+* (now stale) dispose list state flag
+*/
+   if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
+   bp->b_state &= ~XFS_BSTATE_DISPOSE;
+   atomic_inc(&bp->b_hold);
+   }
spin_unlock(&pag->pag_buf_lock);
} else {
-   xfs_buf_lru_del(bp);
+   /*
+* most of the time buffers will already be removed from
+* the LRU, so optimise that case by checking for the
+* XFS_BSTATE_DISPOSE flag indicating the last list the
+* buffer was on was the disposal list
+*/
+   if (!(bp->b_state & XFS_BSTATE_DISPOSE)) {
+   list_lru_del(&

[Devel] [PATCH 2/4] ms/xfs: convert buftarg LRU to generic code

2016-12-06 Thread Dmitry Monakhov

Convert the buftarg LRU to use the new generic LRU list and take advantage
of the functionality it supplies to make the buffer cache shrinker node
aware.

Signed-off-by: Glauber Costa 
Signed-off-by: Dave Chinner 
Cc: "Theodore Ts'o" 
Cc: Adrian Hunter 
Cc: Al Viro 
Cc: Artem Bityutskiy 
Cc: Arve Hjønnevåg 
Cc: Carlos Maiolino 
Cc: Christoph Hellwig 
Cc: Chuck Lever 
Cc: Daniel Vetter 
Cc: David Rientjes 
Cc: Gleb Natapov 
Cc: Greg Thelen 
Cc: J. Bruce Fields 
Cc: Jan Kara 
Cc: Jerome Glisse 
Cc: John Stultz 
Cc: KAMEZAWA Hiroyuki 
Cc: Kent Overstreet 
Cc: Kirill A. Shutemov 
Cc: Marcelo Tosatti 
Cc: Mel Gorman 
Cc: Steven Whitehouse 
Cc: Thomas Hellstrom 
Cc: Trond Myklebust 
Signed-off-by: Andrew Morton 
Signed-off-by: Al Viro 
(cherry picked from commit e80dfa19976b884db1ac2bc5d7d6ca0a4027bd1c)
Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_buf.c | 170 ++-
 fs/xfs/xfs_buf.h |   5 +-
 2 files changed, 81 insertions(+), 94 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c0de0e2..87a314a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
  * The LRU takes a new reference to the buffer so that it will only be freed
  * once the shrinker takes the buffer off the LRU.
  */
-STATIC void
+static void
 xfs_buf_lru_add(
struct xfs_buf  *bp)
 {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   spin_lock(&btp->bt_lru_lock);
-   if (list_empty(&bp->b_lru)) {
-   atomic_inc(&bp->b_hold);
-   list_add_tail(&bp->b_lru, &btp->bt_lru);
-   btp->bt_lru_nr++;
+   if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+   atomic_inc(&bp->b_hold);
}
-   spin_unlock(&btp->bt_lru_lock);
 }
 
 /*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
  * The unlocked check is safe here because it only occurs when there are not
  * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
  * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
  */
-STATIC void
+static void
 xfs_buf_lru_del(
struct xfs_buf  *bp)
 {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   if (list_empty(&bp->b_lru))
-   return;
-
-   spin_lock(&btp->bt_lru_lock);
-   if (!list_empty(&bp->b_lru)) {
-   list_del_init(&bp->b_lru);
-   btp->bt_lru_nr--;
-   }
-   spin_unlock(&btp->bt_lru_lock);
+   list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
 }
 
 /*
@@ -199,18 +182,10 @@ xfs_buf_stale(
xfs_buf_ioacct_dec(bp);
 
atomic_set(&(bp)->b_lru_ref, 0);
-   if (!list_empty(&bp->b_lru)) {
-   struct xfs_buftarg *btp = bp->b_target;
-
-   spin_lock(&btp->bt_lru_lock);
-   if (!list_empty(&bp->b_lru) &&
-   !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
-   list_del_init(&bp->b_lru);
-   btp->bt_lru_nr--;
-   atomic_dec(&bp->b_hold);
-   }
-   spin_unlock(&btp->bt_lru_lock);
-   }
+   if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+   (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
+   atomic_dec(&bp->b_hold);
+
ASSERT(atomic_read(&bp->b_hold) >= 1);
 }
 
@@ -1597,11 +1572,14 @@ xfs_buf_iomove(
  * returned. These buffers will have an elevated hold count, so wait on those
  * while freeing all the buffers only held by the LRU.
  */
-void
-xfs_wait_buftarg(
-   struct xfs_buftarg  *btp)
+static enum lru_status
+xfs_buftarg_wait_rele(
+   struct list_head*item,
+   spinlock_t  *lru_lock,
+   void*arg)
+
 {
-   struct xfs_buf  *bp;
+   struct xfs_buf  *bp = container_of(item, struct xfs_buf, b_lru);
 
/*
 * First wait on the buftarg I/O count for all in-flight buffers to be
@@ -1619,23 +1597,18 @@ xfs_wait_buftarg(
delay(100);
flush_workqueue(btp->bt_mount->m_buf_workqueue);
 
-restart:
-   spin_lock(&btp->bt_lru_lock);
-   while (!list_empty(&btp->bt_lru)) {
-   bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-   if (atomic_read(&bp->b_hold) > 1) {
-   trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-   list_move_tail(&bp->b_lru, &btp->bt_lru);
-   spin_unlock(&

[Devel] [PATCH 3/4] From c70ded437bb646ace0dcbf3c7989d4edeed17f7e Mon Sep 17 00:00:00 2001 [PATCH 2/3] ms/xfs-convert-buftarg-lru-to-generic-code-fix

2016-12-06 Thread Dmitry Monakhov

From: Andrew Morton 

fix warnings

Cc: Dave Chinner 
Cc: Glauber Costa 
Signed-off-by: Andrew Morton 
Signed-off-by: Al Viro 
(cherry picked from commit addbda40bed47d8942658fca93e14b5f1cbf009a)

Signed-off-by: Vladimir Davydov 
Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_buf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 87a314a..bf933d5 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1654,7 +1654,7 @@ xfs_buftarg_isolate(
return LRU_REMOVED;
 }
 
-static long
+static unsigned long
 xfs_buftarg_shrink_scan(
struct shrinker *shrink,
struct shrink_control   *sc)
@@ -1662,7 +1662,7 @@ xfs_buftarg_shrink_scan(
struct xfs_buftarg  *btp = container_of(shrink,
struct xfs_buftarg, bt_shrinker);
LIST_HEAD(dispose);
-   longfreed;
+   unsigned long   freed;
unsigned long   nr_to_scan = sc->nr_to_scan;
 
freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
@@ -1678,7 +1678,7 @@ xfs_buftarg_shrink_scan(
return freed;
 }
 
-static long
+static unsigned long
 xfs_buftarg_shrink_count(
struct shrinker *shrink,
struct shrink_control   *sc)
-- 
2.7.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/4] Revert: [fs] xfs: rework buffer dispose list tracking

2016-12-06 Thread Dmitry Monakhov

From: Dave Chinner 

35c0abc0c70cfb3b37505ec137beae7fabca6b79 Mon Sep 17 00:00:00 2001
Message-id: <1472129410-4267-1-git-send-email-bfos...@redhat.com>
Patchwork-id: 157287
O-Subject: [RHEL7 PATCH] xfs: rework buffer dispose list tracking
Bugzilla: 1349175
RH-Acked-by: Dave Chinner 
RH-Acked-by: Eric Sandeen 

- Retain the buffer lru helpers as rhel7 does not include built-in
  list_lru infrastructure.
- Some b_lock bits dropped as they were introduced by a previous
  selective backport.
- Backport use of dispose list from upstream list_lru-based
  xfs_wait_buftarg[_rele]() to downstream variant.

commit a408235726aa82c0358c9ec68124b6f4bc0a79df
Author: Dave Chinner 
Date:   Wed Aug 28 10:18:06 2013 +1000

xfs: rework buffer dispose list tracking

In converting the buffer lru lists to use the generic code, the locking
for marking the buffers as on the dispose list was lost.  This results in
confusion in LRU buffer tracking and acocunting, resulting in reference
counts being mucked up and filesystem beig unmountable.

To fix this, introduce an internal buffer spinlock to protect the state
field that holds the dispose list information.  Because there is now
locking needed around xfs_buf_lru_add/del, and they are used in exactly
one place each two lines apart, get rid of the wrappers and code the logic
directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and repeat
the walk that fills the dispose list until the LRU is empty.  Thi avoids
needing to drop and regain the LRU lock for every item being freed, and
allows the same logic as the shrinker isolate call to be used.  Simpler,
easier to understand.

Signed-off-by: Dave Chinner 
Signed-off-by: Glauber Costa 
Cc: "Theodore Ts'o" 
Cc: Adrian Hunter 
Cc: Al Viro 
Cc: Artem Bityutskiy 
Cc: Arve Hjonnevag 
Cc: Carlos Maiolino 
Cc: Christoph Hellwig 
Cc: Chuck Lever 
Cc: Daniel Vetter 
Cc: David Rientjes 
Cc: Gleb Natapov 
Cc: Greg Thelen 
Cc: J. Bruce Fields 
Cc: Jan Kara 
Cc: Jerome Glisse 
Cc: John Stultz 
Cc: KAMEZAWA Hiroyuki 
Cc: Kent Overstreet 
Cc: Kirill A. Shutemov 
Cc: Marcelo Tosatti 
Cc: Mel Gorman 
Cc: Steven Whitehouse 
Cc: Thomas Hellstrom 
Cc: Trond Myklebust 
Signed-off-by: Andrew Morton 
Signed-off-by: Al Viro 

Signed-off-by: Brian Foster 
Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_buf.c | 57 
 fs/xfs/xfs_buf.h |  8 +++-
 2 files changed, 11 insertions(+), 54 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e380398..c0de0e2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -96,7 +96,7 @@ xfs_buf_lru_add(
atomic_inc(&bp->b_hold);
list_add_tail(&bp->b_lru, &btp->bt_lru);
btp->bt_lru_nr++;
-   bp->b_state &= ~XFS_BSTATE_DISPOSE;
+   bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
}
spin_unlock(&btp->bt_lru_lock);
 }
@@ -198,21 +198,19 @@ xfs_buf_stale(
 */
xfs_buf_ioacct_dec(bp);
 
-   spin_lock(&bp->b_lock);
-   atomic_set(&bp->b_lru_ref, 0);
+   atomic_set(&(bp)->b_lru_ref, 0);
if (!list_empty(&bp->b_lru)) {
struct xfs_buftarg *btp = bp->b_target;
 
spin_lock(&btp->bt_lru_lock);
if (!list_empty(&bp->b_lru) &&
-   !(bp->b_state & XFS_BSTATE_DISPOSE)) {
+   !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
list_del_init(&bp->b_lru);
btp->bt_lru_nr--;
atomic_dec(&bp->b_hold);
}
spin_unlock(&btp->bt_lru_lock);
}
-   spin_unlock(&bp->b_lock);
ASSERT(atomic_read(&bp->b_hold) >= 1);
 }
 
@@ -1014,26 +1012,10 @@ xfs_buf_rele(
/* the last reference has been dropped ... */
xfs_buf_ioacct_dec(bp);
if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
-   /*
-* If the buffer is added to the LRU take a new
-* reference to the buffer for the LRU and clear the
-* (now stale) dispose list state flag
-*/
xfs_buf_lru_add(bp);
spin_unlock(&pag->pag_buf_lock);
} else {
-   /*
-* most of the time buffers will already be removed from
-* the LRU, so optimise that case by checking for the
-* XFS_BSTATE_DISPOSE flag indicating the last list the
-* buffer was on was the disposal lis

Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: trust server file size unless opened

2016-12-15 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> Before the patch, the only way to pick up updated file size from server (in a
> scenario when local inode was created earlier, then the file was updated
> from another node) was in fuse_open_common():
>
>>  atomic_inc(&fi->num_openers);
>>
>>  if (atomic_read(&fi->num_openers) == 1) {
>>  err = fuse_getattr_size(inode, file, &size);
>>  ...
>>  spin_lock(&fc->lock);
>>  i_size_write(inode, size);
>>  spin_unlock(&fc->lock);
>>  }
>
> This is correct, but someone may ask about i_size w/o open, e.g.: ls -l foo.
> The patch ensures that every time the server reports us some file size, if no
> open-s happened yet (num_openers=0), fuse stores that server size in local
> inode->i_size. This resolves the following problem:
>
> # pstorage-mount -c test -l /var/log/f1.log /pcs1
> # pstorage-mount -c test -l /var/log/f2.log /pcs2
>
> # date > /pcs1/foo; ls -l /pcs1/foo /pcs2/foo
> -rwx-- 1 root root 29 Dec 14 16:31 /pcs1/foo
> -rwx-- 1 root root 29 Dec 14 16:31 /pcs2/foo
>
> # date >> /pcs1/foo; ls -l /pcs1/foo /pcs2/foo
> -rwx-- 1 root root 58 Dec 14 16:31 /pcs1/foo
> -rwx-- 1 root root 29 Dec 14 16:31 /pcs2/foo
>
> https://jira.sw.ru/browse/PSBM-57047
>
> Signed-off-by: Maxim Patlasov 
Ok. But IMHO fi->num_openers is redundant it protects special metadata,
but thre are other cases where we may get client/server mdata out of sync.

> ---
>  fs/fuse/file.c   |   12 +++-
>  fs/fuse/fuse_i.h |3 +++
>  fs/fuse/inode.c  |4 +++-
>  3 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9cad8c5..62967d2 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -296,12 +296,20 @@ int fuse_open_common(struct inode *inode, struct file 
> *file, bool isdir)
>   u64 size;
>  
>   mutex_lock(&inode->i_mutex);
> +
> + spin_lock(&fc->lock);
>   atomic_inc(&fi->num_openers);
>  
>   if (atomic_read(&fi->num_openers) == 1) {
> + fi->i_size_unstable = 1;
> + spin_unlock(&fc->lock);
>   err = fuse_getattr_size(inode, file, &size);
>   if (err) {
> + spin_lock(&fc->lock);
>   atomic_dec(&fi->num_openers);
> + fi->i_size_unstable = 0;
> + spin_unlock(&fc->lock);
> +
>   mutex_unlock(&inode->i_mutex);
>   fuse_release_common(file, FUSE_RELEASE);
>   return err;
> @@ -309,8 +317,10 @@ int fuse_open_common(struct inode *inode, struct file 
> *file, bool isdir)
>  
>   spin_lock(&fc->lock);
>   i_size_write(inode, size);
> + fi->i_size_unstable = 0;
> + spin_unlock(&fc->lock);
> + } else
>   spin_unlock(&fc->lock);
> - }
>  
>   mutex_unlock(&inode->i_mutex);
>   }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 1d24bf6..22eb9c9 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -124,6 +124,9 @@ struct fuse_inode {
>  
>   /** Mostly to detect very first open */
>   atomic_t num_openers;
> +
> + /** Even though num_openers>0, trust server i_size */
> + int i_size_unstable;
>  };
>  
>  /** FUSE inode state bits */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 5ccecae..f606deb 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -97,6 +97,7 @@ static struct inode *fuse_alloc_inode(struct super_block 
> *sb)
>   fi->writectr = 0;
>   fi->orig_ino = 0;
>   fi->state = 0;
> + fi->i_size_unstable = 0;
>   INIT_LIST_HEAD(&fi->write_files);
>   INIT_LIST_HEAD(&fi->rw_files);
>   INIT_LIST_HEAD(&fi->queued_writes);
> @@ -226,7 +227,8 @@ void fuse_change_attributes(struct inode *inode, struct 
> fuse_attr *attr,
>* extend local i_size without keeping userspace server in sync. So,
>* attr->size coming from server can be stale. We cannot trust it.
>*/
> - if (!is_wb || !S_ISREG(inode->i_mode))
> + if (!is_wb || !S_ISREG(inode->i_mode) ||
> + !atomic_read(&fi->num_openers) || fi->i_size_unstable)
>   i_size_write(inode, attr->size);
>   spin_unlock(&fc->lock);
>  
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz7] fuse: fuse_writepage_locked must check for FUSE_INVALIDATE_FILES (v2)

2017-01-12 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> The patch fixes another race dealing with fuse_invalidate_files,
> this time when it races with truncate(2):
>
> Thread A: the flusher performs writeback as usual:
>
>   fuse_writepages -->
> fuse_send_writepages -->
>   end_page_writeback
>
> but before fuse_send_writepages acquires fc->lock and calls 
> fuse_flush_writepages,
> some innocent user process re-dirty-es the page.
>
> Thread B: truncate(2) attempts to truncate (shrink) file as usual:
>
>   fuse_do_setattr -->
> invalidate_inode_pages2
>
> (This is possible because Thread A has not incremented fi->writectr yet.) But
> invalidate_inode_pages2 finds that re-dirty-ed page and sticks in:
>
>   invalidate_inode_pages2 -->
> fuse_launder_page -->
>   fuse_writepage_locked -->
>   fuse_wait_on_page_writeback
>
> Thread A: the flusher proceeds with fuse_flush_writepages, sends write request
> to userspace fuse daemon, but the daemon is not obliged to fulfill it 
> immediately.
> So, thread B waits now for thread A, while thread A waits for userspace.
>
> Now fuse_invalidate_files steps in sticking in filemap_write_and_wait on the
> page locked by Thread B (launder_page always work on a locked page). Deadlock.
>
> The patch fixes deadlock by waking up fuse_writepage_locked after marking
> files with FAIL_IMMEDIATELY flag.
>
> Changed in v2:
>   - instead of flagging "fail_immediately", let fuse_writepage_locked return
> fuse_file pointer, then the caller (fuse_launder_page) can use it for
> conditional wait on __fuse_wait_on_page_writeback_or_invalidate. This is
> important because otherwise fuse_invalidate_files may deadlock when
> launder waits for fuse writeback.
ACK-by: dmonak...@openvz.org
>
> Signed-off-by: Maxim Patlasov 
> ---
>  fs/fuse/file.c |   51 +--
>  1 file changed, 45 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 0ffc806..34e75c2 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1963,7 +1963,8 @@ static struct fuse_file *fuse_write_file(struct 
> fuse_conn *fc,
>  }
>  
>  static int fuse_writepage_locked(struct page *page,
> -  struct writeback_control *wbc)
> +  struct writeback_control *wbc,
> +  struct fuse_file **ff_pp)
>  {
>   struct address_space *mapping = page->mapping;
>   struct inode *inode = mapping->host;
> @@ -1971,13 +1972,30 @@ static int fuse_writepage_locked(struct page *page,
>   struct fuse_inode *fi = get_fuse_inode(inode);
>   struct fuse_req *req;
>   struct page *tmp_page;
> + struct fuse_file *ff;
> + int err = 0;
>  
>   if (fuse_page_is_writeback(inode, page->index)) {
>   if (wbc->sync_mode != WB_SYNC_ALL) {
>   redirty_page_for_writepage(wbc, page);
>   return 0;
>   }
> - fuse_wait_on_page_writeback(inode, page->index);
> +
> + /* we can acquire ff here because we do have locked pages here! 
> */
> + ff = fuse_write_file(fc, get_fuse_inode(inode));
> + if (!ff)
> + goto dummy_end_page_wb_err;
> +
> + /* FUSE_NOTIFY_INVAL_FILES must be able to wake us up */
> + __fuse_wait_on_page_writeback_or_invalidate(inode, ff, 
> page->index);
> +
> + if (test_bit(FUSE_S_FAIL_IMMEDIATELY, &ff->ff_state)) {
> + if (ff_pp)
> + *ff_pp = ff;
> + goto dummy_end_page_wb;
> + }
> +
> + fuse_release_ff(inode, ff);
>   }
>  
>   if (test_set_page_writeback(page))
> @@ -1995,6 +2013,8 @@ static int fuse_writepage_locked(struct page *page,
>   req->ff = fuse_write_file(fc, fi);
>   if (!req->ff)
>   goto err_nofile;
> + if (ff_pp)
> + *ff_pp = fuse_file_get(req->ff);
>   fuse_write_fill(req, req->ff, page_offset(page), 0);
>   fuse_account_request(fc, PAGE_CACHE_SIZE);
>  
> @@ -2029,13 +2049,23 @@ err_free:
>  err:
>   end_page_writeback(page);
>   return -ENOMEM;
> +
> +dummy_end_page_wb_err:
> + printk("FUSE: page under fwb dirtied on dead file\n");
> + err = -EIO;
> + /* fall through ... */
> +dummy_end_page_wb:
> + if (test_set_page_writeback(page))
> + BUG();
> + end_page_writeback(page);
> + return err;
>  }
>  
>  static int fuse_writepage(struct page *page, struct writeback_control *wbc)
>  {
>   int err;
>  
> - err = fuse_writepage_locked(page, wbc);
> + err = fuse_writepage_locked(page, wbc, NULL);
>   unlock_page(page);
>  
>   return err;
> @@ -2423,9 +2453,18 @@ static int fuse_launder_page(struct page *page)
>   struct writeback_control wbc = {
>   .sync_mode = WB_SYNC_ALL,
>   };
> - err = fuse_writepage_locked(

[Devel] [PATCH] ms/xfs: rework buffer dispose list tracking B

2017-01-27 Thread Dmitry Monakhov

Add lost hunks from original a408235726

https://jira.sw.ru/browse/PSBM-58492
Signed-off-by: Dmitry Monakhov 
---
 fs/xfs/xfs_buf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8d8c9ce..47a6cb0 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1585,7 +1585,7 @@ xfs_buftarg_wait_rele(
 */
atomic_set(&bp->b_lru_ref, 0);
bp->b_state |= XFS_BSTATE_DISPOSE;
-   list_move(item, dispose);
+   list_lru_isolate_move(lru, item, dispose);
spin_unlock(&bp->b_lock);
return LRU_REMOVED;
 }
@@ -1646,7 +1646,7 @@ xfs_buftarg_isolate(
}
 
bp->b_state |= XFS_BSTATE_DISPOSE;
-   list_move(item, dispose);
+   list_lru_isolate_move(lru, item, dispose);
spin_unlock(&bp->b_lock);
return LRU_REMOVED;
 }
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [vzlin-dev] [PATCH vz7] fuse: fuse_prepare_write() cannot handle page from killed request

2017-02-14 Thread Dmitry Monakhov

Maxim Patlasov  writes:

> After fuse_prepare_write() called __fuse_readpage(file, page, ...),
> the page might be already unlocked by fuse_kill_requests():
>
>>  for (i = 0; i < req->num_pages; i++) {
>>  struct page *page = req->pages[i];
>>  SetPageError(page);
>>  unlock_page(page);
ACK.
>
> so it is incorrect to touch it at all. The problem can be easily
> fixed the same way is it was done in fuse_readpage() checking "killed"
> flag.
>
> Another minor complication is that there are three different use-cases
> for that snippet from fuse_kill_requests() above: fuse_readpages(),
> fuse_readpage() and fuse_prepare_write(). Among them only the latter
> needs explicit page_cache_release() call. That's why the patch introduces
> ad-hoc request flag "page_needs_release".
>
> https://jira.sw.ru/browse/PSBM-54547
> Signed-off-by: Maxim Patlasov 
> ---
>  fs/fuse/file.c   |   15 ++-
>  fs/fuse/fuse_i.h |3 +++
>  fs/fuse/inode.c  |2 ++
>  3 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a514748..41ed6f0 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1008,7 +1008,7 @@ static void fuse_short_read(struct fuse_req *req, 
> struct inode *inode,
>  
>  static int __fuse_readpage(struct file *file, struct page *page, size_t 
> count,
>  int *err, struct fuse_req **req_pp, u64 *attr_ver_p,
> -bool *killed_p)
> +bool page_needs_release, bool *killed_p)
>  {
>   struct fuse_io_priv io = { .async = 0, .file = file };
>   struct inode *inode = page->mapping->host;
> @@ -1040,6 +1040,7 @@ static int __fuse_readpage(struct file *file, struct 
> page *page, size_t count,
>   req->pages[0] = page;
>   req->page_descs[0].length = count;
>   req->page_cache = 1;
> + req->page_needs_release = page_needs_release;
>  
>   num_read = fuse_send_read(req, &io, page_offset(page), count, NULL);
>   killed = req->killed;
> @@ -1071,7 +1072,7 @@ static int fuse_readpage(struct file *file, struct page 
> *page)
>   goto out;
>  
>   num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver,
> -&killed);
> +false, &killed);
>   if (!err) {
>   /*
>* Short read means EOF.  If file size is larger, truncate it
> @@ -1153,6 +1154,7 @@ static void fuse_send_readpages(struct fuse_req *req, 
> struct file *file)
>   req->out.page_zeroing = 1;
>   req->out.page_replace = 1;
>   req->page_cache = 1;
> + req->page_needs_release = false;
>   fuse_read_fill(req, file, pos, count, FUSE_READ);
>   fuse_account_request(fc, count);
>   req->misc.read.attr_ver = fuse_get_attr_version(fc);
> @@ -2368,6 +2370,7 @@ static int fuse_prepare_write(struct fuse_conn *fc, 
> struct file *file,
>   unsigned num_read;
>   unsigned page_len;
>   int err;
> + bool killed = false;
>  
>   if (fuse_file_fail_immediately(file)) {
>   unlock_page(page);
> @@ -2385,12 +2388,14 @@ static int fuse_prepare_write(struct fuse_conn *fc, 
> struct file *file,
>   }
>  
>   num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL,
> -NULL);
> +true, &killed);
>   if (req)
>   fuse_put_request(fc, req);
>   if (err) {
> - unlock_page(page);
> - page_cache_release(page);
> + if (!killed) {
> + unlock_page(page);
> + page_cache_release(page);
> + }
>   } else if (num_read != PAGE_CACHE_SIZE) {
>   zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
>   }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 22eb9c9..fefa8ff 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -330,6 +330,9 @@ struct fuse_req {
>   /** Request contains pages from page-cache */
>   unsigned page_cache:1;
>  
> + /** Request pages need page_cache_release() */
> + unsigned page_needs_release:1;
> +
>   /** Request was killed -- pages were released */
>   unsigned killed:1;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b63aae2..ddd858c 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -378,6 +378,8 @@ static void fuse_kill_requests(struct fuse_conn *fc, 
> struct inode *inode,
>   struct page *page = req->pages[i];
>   SetPageError(page);
>   unlock_page(page);
> + if (req->page_needs_release)
> + page_cache_release(page);
>   req->pages[i] = NULL;
>   }
>  
___
Devel

[Devel] [PATCH] vz6 ext4: Discard preallocated block before swap_extents v2

2017-02-27 Thread Dmitry Monakhov

Inode preallocation consists of two parts (used and unused) fully controlled
by inode, so it must be discarded before swap extents.
Currently we may skip drop_preallocation if file is sparse.

This patch does:
- Moves ext4_discard_preallocations to ext4_swap_extents.
  This makes more readable and reliable for future changes.
- Cleanup main move_extent loop

https://jira.sw.ru/browse/PSBM-57003
xfstests:ext4/024 (pended: 
https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/extents.c |  3 +++
 fs/ext4/move_extent.c | 17 +++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 85c4d4e..fd49ab0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4371,6 +4371,9 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
BUG_ON(!mutex_is_locked(&inode1->i_mutex));
BUG_ON(!mutex_is_locked(&inode1->i_mutex));
 
+   ext4_discard_preallocations(inode1);
+   ext4_discard_preallocations(inode2);
+
while (count) {
struct ext4_extent *ex1, *ex2, tmp_ex;
ext4_lblk_t e1_blk, e2_blk;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 39eaa8f..df904aa 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -628,6 +628,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
ext4_lblk_t o_end, o_start = orig_blk;
ext4_lblk_t d_start = donor_blk;
int ret;
+   __u64 m_len = *moved_len;
 
if (orig_inode->i_sb != donor_inode->i_sb) {
ext4_debug("ext4 move extent: The argument files "
@@ -696,7 +697,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
if (next_blk == EXT_MAX_BLOCKS) {
o_start = o_end;
ret = -ENODATA;
-   goto out;
+   break;
}
d_start += next_blk - o_start;
o_start = next_blk;
@@ -708,7 +709,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
o_start = cur_blk;
/* Extent inside requested range ?*/
if (cur_blk >= o_end)
-   goto out;
+   break;
} else { /* in_range(o_start, o_blk, o_len) */
cur_len += cur_blk - o_start;
}
@@ -743,6 +744,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
break;
o_start += cur_len;
d_start += cur_len;
+   m_len += cur_len;
repeat:
if (path) {
ext4_ext_drop_refs(path);
@@ -750,15 +752,10 @@ ext4_move_extents(struct file *o_filp, struct file 
*d_filp, __u64 orig_blk,
path = NULL;
}
}
-   *moved_len = o_start - orig_blk;
-   if (*moved_len > len)
-   *moved_len = len;
-
 out:
-   if (*moved_len) {
-   ext4_discard_preallocations(orig_inode);
-   ext4_discard_preallocations(donor_inode);
-   }
+   WARN_ON(m_len > len);
+   if (ret == 0)
+   *moved_len = m_len;
 
if (path) {
ext4_ext_drop_refs(path);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH:vz7] ext4: fix seek_data soft lookup on sparse files

2017-02-27 Thread Dmitry Monakhov

Good fix requires optimal implementation of next_extent like it was
done in 14516bb or 2d90c160, but this makes patch huge, let's
just break the loop when necessery.

https://jira.sw.ru/browse/PSBM-55818
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index c63d937..167e262 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -612,7 +612,17 @@ static loff_t ext4_seek_data(struct file *file, loff_t 
offset, loff_t maxsize)
if (unwritten)
break;
}
-
+   if (signal_pending(current)) {
+   mutex_unlock(&inode->i_mutex);
+   return -EINTR;
+   }
+   if (need_resched()) {
+   mutex_unlock(&inode->i_mutex);
+   cond_resched();
+   mutex_lock(&inode->i_mutex);
+   isize = inode->i_size;
+   end = isize >> blkbits;
+   }
last++;
dataoff = (loff_t)last << blkbits;
} while (last <= end);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] vz6 ext4: Discard preallocated block before swap_extents

2017-02-27 Thread Dmitry Monakhov

Inode preallocation consists of two parts (used and unused) fully controlled
by inode, so it must be discarded before swap extents.
Currently we may skip drop_preallocation if file is sparse.

This patch does:
- Moves ext4_discard_preallocations to ext4_swap_extents.
  This makes more readable and reliable for future changes.
- Cleanup main move_extent loop

https://jira.sw.ru/browse/PSBM-57003
xfstests:ext4/024 (pended: 
https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/extents.c |  3 +++
 fs/ext4/move_extent.c | 17 +++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 85c4d4e..fd49ab0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4371,6 +4371,9 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
BUG_ON(!mutex_is_locked(&inode1->i_mutex));
BUG_ON(!mutex_is_locked(&inode1->i_mutex));
 
+   ext4_discard_preallocations(inode1);
+   ext4_discard_preallocations(inode2);
+
while (count) {
struct ext4_extent *ex1, *ex2, tmp_ex;
ext4_lblk_t e1_blk, e2_blk;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 39eaa8f..97a7db5 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -628,6 +628,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
ext4_lblk_t o_end, o_start = orig_blk;
ext4_lblk_t d_start = donor_blk;
int ret;
+   __u64 m_len = *moved_len;
 
if (orig_inode->i_sb != donor_inode->i_sb) {
ext4_debug("ext4 move extent: The argument files "
@@ -696,7 +697,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
if (next_blk == EXT_MAX_BLOCKS) {
o_start = o_end;
ret = -ENODATA;
-   goto out;
+   break;
}
d_start += next_blk - o_start;
o_start = next_blk;
@@ -708,7 +709,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
o_start = cur_blk;
/* Extent inside requested range ?*/
if (cur_blk >= o_end)
-   goto out;
+   break;
} else { /* in_range(o_start, o_blk, o_len) */
cur_len += cur_blk - o_start;
}
@@ -743,6 +744,7 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, 
__u64 orig_blk,
break;
o_start += cur_len;
d_start += cur_len;
+   m_len += cur_len;
repeat:
if (path) {
ext4_ext_drop_refs(path);
@@ -755,15 +757,10 @@ ext4_move_extents(struct file *o_filp, struct file 
*d_filp, __u64 orig_blk,
*moved_len = len;
 
 out:
-   if (*moved_len) {
-   ext4_discard_preallocations(orig_inode);
-   ext4_discard_preallocations(donor_inode);
-   }
+   WARN_ON(m_len > len);
+   if (ret == 0)
+   *moved_len = m_len;
 
-   if (path) {
-   ext4_ext_drop_refs(path);
-   kfree(path);
-   }
up_write(&EXT4_I(orig_inode)->i_data_sem);
up_write(&EXT4_I(donor_inode)->i_data_sem);
up_write(&orig_inode->i_alloc_sem);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH:vz7] ext4: fix seek_data soft lookup on sparse files

2017-02-27 Thread Dmitry Monakhov

Good fix requires optimal implementation of next_extent like it was
done in 14516bb or 2d90c160, but this makes patch huge, let's
just break the loop when necessery.

https://jira.sw.ru/browse/PSBM-55818
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index c63d937..167e262 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -612,7 +612,17 @@ static loff_t ext4_seek_data(struct file *file, loff_t 
offset, loff_t maxsize)
if (unwritten)
break;
}
-
+   if (signal_pending(current)) {
+   mutex_unlock(&inode->i_mutex);
+   return -EINTR;
+   }
+   if (need_resched()) {
+   mutex_unlock(&inode->i_mutex);
+   cond_resched();
+   mutex_lock(&inode->i_mutex);
+   isize = inode->i_size;
+   end = isize >> blkbits;
+   }
last++;
dataoff = (loff_t)last << blkbits;
} while (last <= end);
-- 
2.9.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH] vz6 ext4: Discard preallocated block before swap_extents

2017-02-27 Thread Dmitry Monakhov

Vasily Averin  writes:

> Dima,
> please take look at comment below.
>
> On 2017-02-25 18:16, Dmitry Monakhov wrote:
>> Inode preallocation consists of two parts (used and unused) fully controlled
>> by inode, so it must be discarded before swap extents.
>> Currently we may skip drop_preallocation if file is sparse.
>> 
>> This patch does:
>> - Moves ext4_discard_preallocations to ext4_swap_extents.
>>   This makes more readable and reliable for future changes.
>> - Cleanup main move_extent loop
>> 
>> https://jira.sw.ru/browse/PSBM-57003
>> xfstests:ext4/024 (pended: 
>> https://github.com/dmonakhov/xfstests/commit/7a4763963f73ea5d5bba45eefa484494aa3df7cf)
>> Signed-off-by: Dmitry Monakhov 
>> ---
>>  fs/ext4/extents.c |  3 +++
>>  fs/ext4/move_extent.c | 17 +++--
>>  2 files changed, 10 insertions(+), 10 deletions(-)
>> 
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 85c4d4e..fd49ab0 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -4371,6 +4371,9 @@ ext4_swap_extents(handle_t *handle, struct inode 
>> *inode1,
>>  BUG_ON(!mutex_is_locked(&inode1->i_mutex));
>>  BUG_ON(!mutex_is_locked(&inode1->i_mutex));
>>  
>> +ext4_discard_preallocations(inode1);
>> +ext4_discard_preallocations(inode2);
>> +
>>  while (count) {
>>  struct ext4_extent *ex1, *ex2, tmp_ex;
>>  ext4_lblk_t e1_blk, e2_blk;
>> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
>> index 39eaa8f..97a7db5 100644
>> --- a/fs/ext4/move_extent.c
>> +++ b/fs/ext4/move_extent.c
>> @@ -628,6 +628,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  ext4_lblk_t o_end, o_start = orig_blk;
>>  ext4_lblk_t d_start = donor_blk;
>>  int ret;
>> +__u64 m_len = *moved_len;
>>  
>>  if (orig_inode->i_sb != donor_inode->i_sb) {
>>  ext4_debug("ext4 move extent: The argument files "
>> @@ -696,7 +697,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  if (next_blk == EXT_MAX_BLOCKS) {
>>  o_start = o_end;
>>  ret = -ENODATA;
>> -goto out;
>> +break;
>>  }
>>  d_start += next_blk - o_start;
>>  o_start = next_blk;
>> @@ -708,7 +709,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  o_start = cur_blk;
>>  /* Extent inside requested range ?*/
>>  if (cur_blk >= o_end)
>> -goto out;
>> +break;
>>  } else { /* in_range(o_start, o_blk, o_len) */
>>  cur_len += cur_blk - o_start;
>>  }
>> @@ -743,6 +744,7 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  break;
>>  o_start += cur_len;
>>  d_start += cur_len;
>> +m_len += cur_len;
>>  repeat:
>>  if (path) {
>>  ext4_ext_drop_refs(path);
>> @@ -755,15 +757,10 @@ ext4_move_extents(struct file *o_filp, struct file 
>> *d_filp, __u64 orig_blk,
>>  *moved_len = len;
>>  
>>  out:
>> -if (*moved_len) {
>> -    ext4_discard_preallocations(orig_inode);
>> -ext4_discard_preallocations(donor_inode);
>> -}
>> +WARN_ON(m_len > len);
>> +if (ret == 0)
>> +*moved_len = m_len;
>>  
>> -if (path) {
>> -ext4_ext_drop_refs(path);
>> -kfree(path);
>> -}
>
> I do not understand why kfree for path is dropped here. 
> Rest places looks reasonable for me,
> but this one looks like some mistake.
Yes, this is copy-paste mistake. Please see updated verstion

From: Dmitry Monakhov 
To: devel@openvz.org
Cc: dmonak...@openvz.org,
v...@virtuozzo.com
Subject: [PATCH] vz6 ext4: Discard preallocated block before swap_extents v2
Date: Mon, 27 Feb 2017 15:33:07 +0400
Message-Id: <1488195187-26606-1-git-send-email-dmonak...@openvz.org>

>
> Take a look -- path is still was freed inside cycle,
> why it should not be freed at finish too?
>
>>  up_write(&EXT4_I(orig_inode)->i_data_sem);
>>  up_write(&EXT4_I(donor_inode)->i_data_sem);
>>  up_write(&orig_inode->i_alloc_sem);
>> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/2] ext4/mfsync: Prevent resource abuse

2017-03-15 Thread Dmitry Monakhov

- Mfsync is not standard interface let's hide it from VEs
- Limit number of files in single request.


https://jira.sw.ru/browse/PSBM-59965
https://jira.sw.ru/browse/PSBM-59966
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/ioctl.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index cd831d5..9232330 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -783,12 +783,17 @@ resize_out:
__u32 __user *usr_fd;
int i, err;
 
+   if (!ve_is_super(get_exec_env()))
+   return -ENOTSUPP;
if (copy_from_user(&mfsync, (struct ext4_ioc_mfsync_info *)arg,
   sizeof(mfsync)))
return -EFAULT;
 
if (mfsync.size == 0)
return 0;
+   if (mfsync.size > NR_FILE)
+   return -ENFILE;
+
usr_fd = (__u32 __user *) (arg + sizeof(__u32));
 
filpp = kzalloc(mfsync.size * sizeof(*filp), GFP_KERNEL);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/2] mfsync: cleanup

2017-03-15 Thread Dmitry Monakhov

Long time ago prink was used for debug purposes only, and was merged by 
occasion,
it was cleanedup in b4d7159537296b, but resurected after rebase. Let's kill it 
completely.

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/ioctl.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index bb372fa..cd831d5 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -784,10 +784,9 @@ resize_out:
int i, err;
 
if (copy_from_user(&mfsync, (struct ext4_ioc_mfsync_info *)arg,
-  sizeof(mfsync))) {
-   printk("%s:%d", __FUNCTION__, __LINE__);
+  sizeof(mfsync)))
return -EFAULT;
-   }
+
if (mfsync.size == 0)
return 0;
usr_fd = (__u32 __user *) (arg + sizeof(__u32));
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] WARNING at mm/slub.c

2017-03-20 Thread Dmitry Monakhov


Denis Kirjanov  writes:

> On 3/16/17, Denis Kirjanov  wrote:
>> Hi guys,
>>
>> with the kernel rh7-3.10.0-327.36.1.vz7.18.7 we're seeing the
>> following WARNING while running LTP test suite:
>>
>> [11796.576981] WARNING: at mm/slub.c:1252
>> slab_pre_alloc_hook.isra.42.part.43+0x15/0x17()
>>
>> [11796.591008] Call Trace:
>> [11796.592065]  [] dump_stack+0x19/0x1b
>> [11796.593076]  [] warn_slowpath_common+0x70/0xb0
>> [11796.594228]  [] warn_slowpath_null+0x1a/0x20
>> [11796.595442]  []
>> slab_pre_alloc_hook.isra.42.part.43+0x15/0x17
>> [11796.596686]  [] kmem_cache_alloc_trace+0x58/0x230
>> [11796.597965]  [] ? kmapset_new+0x1e/0x50
>> [11796.599224]  [] kmapset_new+0x1e/0x50
>> [11796.600433]  [] __sysfs_add_one+0x4a/0xb0
>> [11796.601431]  [] sysfs_add_one+0x1b/0xd0
>> [11796.602451]  [] sysfs_add_file_mode+0xb7/0x100
>> [11796.603449]  [] sysfs_create_file+0x2a/0x30
>> [11796.604461]  [] kobject_add_internal+0x16c/0x2f0
>> [11796.605503]  [] kobject_add+0x75/0xd0
>> [11796.606627]  [] ? kmem_cache_alloc_trace+0x207/0x230
>> [11796.607655]  [] __link_block_group+0xe1/0x120 [btrfs]
>> [11796.608634]  [] btrfs_make_block_group+0x150/0x270
>> [btrfs]
>> [11796.609701]  [] __btrfs_alloc_chunk+0x67f/0x8a0
>> [btrfs]
>> [11796.610756]  [] btrfs_alloc_chunk+0x34/0x40 [btrfs]
>> [11796.611800]  [] do_chunk_alloc+0x23f/0x410 [btrfs]
>> [11796.612954]  []
>> btrfs_check_data_free_space+0xea/0x280 [btrfs]
>> [11796.614008]  [] __btrfs_buffered_write+0x151/0x5c0
>> [btrfs]
>> [11796.615153]  [] btrfs_file_aio_write+0x246/0x560
>> [btrfs]
>> [11796.616141]  [] ?
>> __mem_cgroup_commit_charge+0x152/0x350
>> [11796.617220]  [] do_sync_write+0x90/0xe0
>> [11796.618253]  [] vfs_write+0xbd/0x1e0
>> [11796.619224]  [] SyS_write+0x7f/0xe0
>> [11796.620185]  [] system_call_fastpath+0x16/0x1b
>> [11796.621145] ---[ end trace 1437311f89b9e3c6 ]---
>>
>
> Guys, I've found your commit:
>
> commit 149819fef38230c95f4d6c644061bc8b0dcdd51d
> Author: Vladimir Davydov 
> Date:   Fri Jun 5 13:20:02 2015 +0400
>
> mm/fs: Port diff-mm-debug-memallocation-caused-fs-reentrance
>
> Enable the debug once again, as the issue it found has been fixed:
> https://jira.sw.ru/browse/PSBM-34112
>
> Previous commit: 255427905323ac97a3c9b2d5acb2bf21ea2b31f6.
>
> Author: Dmitry Monakhov
> Email: dmonak...@openvz.org
> Subject: mm: debug memallocation caused fs reentrance
> Date: Sun, 9 Nov 2014 11:53:14 +0400
>
> But I can't open a link to figure out the original reason for the patch.
Originally we found that
 [] dump_stack+0x19/0x1b
 [] warn_slowpath_common+0x61/0x80
 [] warn_slowpath_null+0x1a/0x20
 [] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
 [] kmem_cache_alloc+0x55/0x210
 [] ? ext4_mb_add_groupinfo+0xe1/0x230 [ext4]
 [] ext4_mb_add_groupinfo+0xe1/0x230 [ext4]
 [] ext4_flex_group_add+0xba6/0x14b0 [ext4]
 [] ? ext4_bg_num_gdb+0x79/0x90 [ext4]
 [] ext4_resize_fs+0x76d/0xe40 [ext4]
 [] ext4_ioctl+0xded/0x1110 [ext4]
 [] ? do_filp_open+0x4b/0xb0
 [] do_vfs_ioctl+0x255/0x4f0
 [] ? __fd_install+0x47/0x60
 [] SyS_ioctl+0x54/0xa0
 [] system_call_fastpath+0x16/0x1b

This is pure bug, which resut in deadlock, or fscorruption. which I've fixed 
here
https://github.com/torvalds/linux/commit/4fdb5543183d027a19805b72025b859af73d0863
I've realized that his is whole class of locking issues which should be
detected on runtume that is why I've add this warning, I also send the
patch to mainstream http://www.spinics.net/lists/linux-btrfs/msg39034.html
which note that btrfs definitely has fs-reentrance issues
http://www.spinics.net/lists/linux-btrfs/msg39035.html

Dave does not like the way I've do the detection so patch was not
committed, but it exists in our tree, It is resonanable to replace
WARN_ON with WARN_ON_ONCE to prevent spamming. I'll send a patch



>
>
>
>> Thanks!
>>
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] fs: Prevent massive warn spamming

2017-03-20 Thread Dmitry Monakhov

Even if detection spots potential bug, is it not good to bloat kmsg.
WARN_ON_ONCE is enought to detect exact calltrace.

Signed-off-by: Dmitry Monakhov 
---
 mm/page_alloc.c | 2 +-
 mm/slab.c   | 4 ++--
 mm/slub.c   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b799171..d6a04f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3150,7 +3150,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
lockdep_trace_alloc(gfp_mask);
 
might_sleep_if(gfp_mask & __GFP_WAIT);
-   WARN_ON((gfp_mask & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((gfp_mask & __GFP_FS) && current->journal_info);
 
if (should_fail_alloc_page(gfp_mask, order))
return NULL;
diff --git a/mm/slab.c b/mm/slab.c
index f0e4b79..4f0c22e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3343,7 +3343,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, 
int nodeid,
flags &= gfp_allowed_mask;
 
lockdep_trace_alloc(flags);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((flags & __GFP_FS) && current->journal_info);
 
if (slab_should_failslab(cachep, flags))
return NULL;
@@ -3433,7 +3433,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, 
unsigned long caller)
flags &= gfp_allowed_mask;
 
lockdep_trace_alloc(flags);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((flags & __GFP_FS) && current->journal_info);
 
if (slab_should_failslab(cachep, flags))
return NULL;
diff --git a/mm/slub.c b/mm/slub.c
index fcebd14..280adf6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1266,7 +1266,7 @@ static inline int slab_pre_alloc_hook(struct kmem_cache 
*s, gfp_t flags)
flags &= gfp_allowed_mask;
lockdep_trace_alloc(flags);
might_sleep_if(flags & __GFP_WAIT);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
+   WARN_ON_ONCE((flags & __GFP_FS) && current->journal_info);
 
return should_failslab(s->object_size, flags, s->flags);
 }
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 v2 1/3] fs/cleancache: fix data invalidation in the cleancache during direct_io

2017-04-12 Thread Dmitry Monakhov

Andrey Ryabinin  writes:

> Currently some direct_io fs hooks call invalidate_inode_pages2_range()
> conditionally iff mapping->nrpages is not zero. So if nrpages is zero,
> data in cleancache wouldn't be invalidated. So the next buffered read
> may get stale data from the cleancache.

>
> Fix this by calling invalidate_inode_pages2_range() regardless of nrpages
> value. And if nrpages is zero, bail out from invalidate_inode_pages2_range()
> only after cleancache_invalidate_inode(), so that we invalidate cleancache
> but still avoid pointless page cache lookups.
BTW, can we please make tcache plugable. So one who do not want fancy
caching features can simply disable it. As we do with pfcache.

>
> https://jira.sw.ru/browse/PSBM-63908
> Signed-off-by: Andrey Ryabinin 
> ---
>  fs/9p/vfs_file.c  |  4 ++--
>  fs/nfs/direct.c   | 16 ++--
>  fs/nfs/inode.c|  7 ---
>  fs/xfs/xfs_file.c | 30 ++
>  mm/filemap.c  | 28 
>  mm/truncate.c |  4 
>  6 files changed, 42 insertions(+), 47 deletions(-)
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 7da03f8..afe0036 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -482,7 +482,7 @@ v9fs_file_write_internal(struct inode *inode, struct 
> p9_fid *fid,
>   if (invalidate && (total > 0)) {
>   pg_start = origin >> PAGE_CACHE_SHIFT;
>   pg_end = (origin + total - 1) >> PAGE_CACHE_SHIFT;
> - if (inode->i_mapping && inode->i_mapping->nrpages)
> + if (inode->i_mapping)
>   invalidate_inode_pages2_range(inode->i_mapping,
> pg_start, pg_end);
>   *offset += total;
> @@ -688,7 +688,7 @@ v9fs_direct_write(struct file *filp, const char __user * 
> data,
>* about to write.  We do this *before* the write so that if we fail
>* here we fall back to buffered write
>*/
> - if (mapping->nrpages) {
> + {
>   pgoff_t pg_start = offset >> PAGE_CACHE_SHIFT;
>   pgoff_t pg_end   = (offset + count - 1) >> PAGE_CACHE_SHIFT;
>  
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index ab96f01..963 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -1132,12 +1132,10 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, 
> const struct iovec *iov,
>   if (result)
>   goto out_unlock;
>  
> - if (mapping->nrpages) {
> - result = invalidate_inode_pages2_range(mapping,
> - pos >> PAGE_CACHE_SHIFT, end);
> - if (result)
> - goto out_unlock;
> - }
> + result = invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_CACHE_SHIFT, end);
> + if (result)
> + goto out_unlock;
>  
>   task_io_account_write(count);
>  
> @@ -1161,10 +1159,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, 
> const struct iovec *iov,
>  
>   result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, uio);
>  
> - if (mapping->nrpages) {
> - invalidate_inode_pages2_range(mapping,
> -   pos >> PAGE_CACHE_SHIFT, end);
> - }
> + invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_CACHE_SHIFT, end);
>  
>   mutex_unlock(&inode->i_mutex);
>  
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 8c06aed..779b05c 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -1065,10 +1065,11 @@ static int nfs_invalidate_mapping(struct inode 
> *inode, struct address_space *map
>   if (ret < 0)
>   return ret;
>   }
> - ret = invalidate_inode_pages2(mapping);
> - if (ret < 0)
> - return ret;
>   }
> + ret = invalidate_inode_pages2(mapping);
> + if (ret < 0)
> + return ret;
> +
>   if (S_ISDIR(inode->i_mode)) {
>   spin_lock(&inode->i_lock);
>   memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 9a2193b..0b7a35b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -346,7 +346,7 @@ xfs_file_aio_read(
>* serialisation.
>*/
>   xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> - if ((ioflags & XFS_IO_ISDIRECT) && inode->i_mapping->nrpages) {
> + if ((ioflags & XFS_IO_ISDIRECT)) {
>   xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
>   xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
>  
> @@ -361,22 +361,20 @@ xfs_file_aio_read(
>* flush and reduce the chances of repeated iolock cycles going
>* forward.
>*/
> - if (inode->i_mapping->nrpages) {
> - ret = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> - if (ret) {
> -

Re: [Devel] [PATCH rh7 v3] ext4: add generic uevent infrastructure

2017-06-16 Thread Dmitry Monakhov

Andrey Ryabinin  writes:

> From: Dmitry Monakhov 
>
> *Purpose:
> It is reasonable to annaunce fs related events via uevent infrastructure.
> This patch implement only ext4'th part, but IMHO this should be usefull for
> any generic filesystem.
>
> Example: Runtime fs-error is pure async event. Currently there is no good
> way to handle this situation and inform user-space about this.
>
> *Implementation:
>  Add uevent infrastructure similar to dm uevent
>  FS_ACTION = {MOUNT|UMOUNT|REMOUNT|ERROR|FREEZE|UNFREEZE}
>  FS_UUID
>  FS_NAME
>  FS_TYPE
>
> Signed-off-by: Dmitry Monakhov 
Only one note about mem allocation context, see below. Otherwise looks good.
>
> https://jira.sw.ru/browse/PSBM-66618
> Signed-off-by: Andrey Ryabinin 
> ---
> Changes since v2:
>   - Report error event only once per superblock
>
>  fs/ext4/ext4.h  | 11 
>  fs/ext4/super.c | 88 
> -
>  2 files changed, 98 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1cd964870da3..ce60718c7143 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1356,6 +1356,8 @@ struct ext4_sb_info {
>   /* Precomputed FS UUID checksum for seeding other checksums */
>   __u32 s_csum_seed;
>  
> + bool s_err_event_sent;
> +
>   /* Reclaim extents from extent status tree */
>   struct shrinker s_es_shrinker;
>   struct list_head s_es_lru;
> @@ -2758,6 +2760,15 @@ extern int ext4_check_blockref(const char *, unsigned 
> int,
>  struct ext4_ext_path;
>  struct ext4_extent;
>  
> +enum ext4_event_type {
> + EXT4_UA_MOUNT,
> + EXT4_UA_UMOUNT,
> + EXT4_UA_REMOUNT,
> + EXT4_UA_ERROR,
> + EXT4_UA_FREEZE,
> + EXT4_UA_UNFREEZE,
> +};
> +
>  /*
>   * Maximum number of logical blocks in a file; ext4_extent's ee_block is
>   * __le32.
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ee065861b62a..088313b6333f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -301,6 +301,79 @@ void ext4_itable_unused_set(struct super_block *sb,
>   bg->bg_itable_unused_hi = cpu_to_le16(count >> 16);
>  }
>  
> +static int ext4_uuid_valid(const u8 *uuid)
> +{
> + int i;
> +
> + for (i = 0; i < 16; i++) {
> + if (uuid[i])
> + return 1;
> + }
> + return 0;
> +}
> +
> +/**
> + * ext4_send_uevent - prepare and send uevent
> + *
> + * @sb:  super_block
> + * @action:  action type
> + *
> + */
> +int ext4_send_uevent(struct super_block *sb, enum ext4_event_type action)
> +{
> + int ret;
> + struct kobj_uevent_env *env;
> + const u8 *uuid = sb->s_uuid;
> + enum kobject_action kaction = KOBJ_CHANGE;
> +
> + env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
Please change GFP_KERNEL to GFP_NOFS otherwise it may deadlock.
> + if (!env)
> + return -ENOMEM;
> +
> + ret = add_uevent_var(env, "FS_TYPE=%s", sb->s_type->name);
> + if (ret)
> + goto out;
> + ret = add_uevent_var(env, "FS_NAME=%s", sb->s_id);
> + if (ret)
> + goto out;
> +
> + if (ext4_uuid_valid(uuid)) {
> + ret = add_uevent_var(env, "UUID=%pUB", uuid);
> + if (ret)
> + goto out;
> + }
> +
> + switch (action) {
> + case EXT4_UA_MOUNT:
> + kaction = KOBJ_ONLINE;
> + ret = add_uevent_var(env, "FS_ACTION=%s", "MOUNT");
> + break;
> + case EXT4_UA_UMOUNT:
> + kaction = KOBJ_OFFLINE;
> + ret = add_uevent_var(env, "FS_ACTION=%s", "UMOUNT");
> + break;
> + case EXT4_UA_REMOUNT:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "REMOUNT");
> + break;
> + case EXT4_UA_ERROR:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR");
> + break;
> + case EXT4_UA_FREEZE:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE");
> + break;
> + case EXT4_UA_UNFREEZE:
> + ret = add_uevent_var(env, "FS_ACTION=%s", "UNFREEZE");
> + break;
> + default:
> + ret = -EINVAL;
> + }
> + if (ret)
> + goto out;
> + ret = kobject_uevent_env(&(EXT4_SB(sb)->s_kobj), kaction, env->envp);
> +out:
> + kfree(env);
> + return ret;
>

[Devel] [PATCH] ext4: send abort uevent on ext4 journal abort.

2017-07-25 Thread Dmitry Monakhov

Currenlty error from device result in ext4_abort, but uevent not generated 
because
ext4_abort() caller's context do not allow GFP_KERNEL memory allocation.
Let's relax submission context requirement and deffer actual uevent submission
to work_queue.  It can be any workqueue I've pick rsv_conversion_wq because it 
is
already exists.

New uevent "ABORT"

https://jira.sw.ru/browse/PSBM-68848
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/super.c | 69 -
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ce60718..1633538 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1357,6 +1357,7 @@ struct ext4_sb_info {
__u32 s_csum_seed;
 
bool s_err_event_sent;
+   bool s_abrt_event_sent;
 
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
@@ -2765,6 +2766,7 @@ enum ext4_event_type {
  EXT4_UA_UMOUNT,
  EXT4_UA_REMOUNT,
  EXT4_UA_ERROR,
+ EXT4_UA_ABORT,
  EXT4_UA_FREEZE,
  EXT4_UA_UNFREEZE,
 };
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 088313b..0016c94 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -312,6 +312,12 @@ static int ext4_uuid_valid(const u8 *uuid)
return 0;
 }
 
+struct ext4_uevent {
+   struct super_block *sb;
+   enum ext4_event_type action;
+   struct work_struct work;
+};
+
 /**
  * ext4_send_uevent - prepare and send uevent
  *
@@ -319,17 +325,20 @@ static int ext4_uuid_valid(const u8 *uuid)
  * @action:action type
  *
  */
-int ext4_send_uevent(struct super_block *sb, enum ext4_event_type action)
+static void ext4_send_uevent_work(struct work_struct *w)
 {
-   int ret;
+   struct ext4_uevent *e = container_of(w, struct ext4_uevent, work);
+   struct super_block *sb = e->sb;
struct kobj_uevent_env *env;
const u8 *uuid = sb->s_uuid;
enum kobject_action kaction = KOBJ_CHANGE;
+   int ret;
 
env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
-   if (!env)
-   return -ENOMEM;
-
+   if (!env){
+   kfree(e);
+   return;
+   }
ret = add_uevent_var(env, "FS_TYPE=%s", sb->s_type->name);
if (ret)
goto out;
@@ -343,7 +352,7 @@ int ext4_send_uevent(struct super_block *sb, enum 
ext4_event_type action)
goto out;
}
 
-   switch (action) {
+   switch (e->action) {
case EXT4_UA_MOUNT:
kaction = KOBJ_ONLINE;
ret = add_uevent_var(env, "FS_ACTION=%s", "MOUNT");
@@ -358,6 +367,9 @@ int ext4_send_uevent(struct super_block *sb, enum 
ext4_event_type action)
case EXT4_UA_ERROR:
ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR");
break;
+   case EXT4_UA_ABORT:
+   ret = add_uevent_var(env, "FS_ACTION=%s", "ABORT");
+   break;
case EXT4_UA_FREEZE:
ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE");
break;
@@ -372,7 +384,33 @@ int ext4_send_uevent(struct super_block *sb, enum 
ext4_event_type action)
ret = kobject_uevent_env(&(EXT4_SB(sb)->s_kobj), kaction, env->envp);
 out:
kfree(env);
-   return ret;
+   kfree(e);
+}
+
+/**
+ * ext4_send_uevent - prepare and schedule event submission
+ *
+ * @sb:super_block
+ * @action:action type
+ *
+ */
+int ext4_send_uevent(struct super_block *sb, enum ext4_event_type action)
+{
+   struct ext4_uevent *e;
+
+   smp_rmb();
+   if (!EXT4_SB(sb)->rsv_conversion_wq)
+   return -EPROTO;
+   
+   e = kzalloc(sizeof(*e), GFP_NOIO);
+   if (!e)
+   return -ENOMEM;
+
+   e->sb = sb;
+   e->action = action;
+   INIT_WORK(&e->work, ext4_send_uevent_work);
+   queue_work(EXT4_SB(sb)->rsv_conversion_wq, &e->work);
+   return 0;
 }
 
 static void __save_error_info(struct super_block *sb, const char *func,
@@ -470,9 +508,13 @@ static void ext4_handle_error(struct super_block *sb)
 
if (!test_opt(sb, ERRORS_CONT)) {
journal_t *journal = EXT4_SB(sb)->s_journal;
+   
+   if (!xchg(&EXT4_SB(sb)->s_abrt_event_sent, 1))
+   ext4_send_uevent(sb, EXT4_UA_ABORT);
 
EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED;
-   if (journal)
+   
+   if (journal) 
jbd2_journal_abort(journal, -EIO);
}
if (test_opt(sb, ERRORS_RO)) {
@@ -664,6 +706,10 @@ void __ext4_abort(struct super_block *sb, const char 
*function,
 
if ((sb->s_flags & MS_RDONLY) == 0) {

[Devel] [PATCH] fused: save logrotate option to fstab

2017-08-17 Thread Dmitry Monakhov

currently one may run 'vstorage-mount -s' with -L option, but
it will affect only current mount w/o reflection to fstab opts.
In fact mount.fuse.vstorage already has parser for logrotate option, so
this patch makes this feature fully supported.

Signed-off-by: Dmitry Monakhov 
---
 pcs/clients/fused/fused.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/pcs/clients/fused/fused.c b/pcs/clients/fused/fused.c
index b73da80..5c4a84e 100644
--- a/pcs/clients/fused/fused.c
+++ b/pcs/clients/fused/fused.c
@@ -176,6 +176,8 @@ static char *make_fstab_options(
int timeout,
char *logfile,
int loglevel,
+   unsigned long rotate_num,
+   unsigned long long rotate_size,
unsigned long mntflags,
char *username,
char *groupname,
@@ -197,6 +199,9 @@ static char *make_fstab_options(
res += fstab_add_option(&out, "logfile=%s", logfile);
if (loglevel != LOG_LEVEL_SRV_DEFAULT)
res += fstab_add_option(&out, "loglevel=%u", 
(unsigned)loglevel);
+   if (rotate_num || rotate_size)
+   res += fstab_add_option(&out, "logrotate=%lux%llu", rotate_num, 
rotate_size);
+
if (g_read_cache.params.pathname)
res += fstab_add_option(&out, "cache=%s", 
g_read_cache.params.pathname);
if (g_read_cache.params.total_sz_mb > 0)
@@ -501,6 +506,7 @@ int main(int argc, char** argv)
unsigned long mntflags = 0;
int ch, res = -1;
int pipefd[2];
+   int rotate_opt = 0;
unsigned long rotate_num = 10;
unsigned long long rotate_size = 100LL * 1024LL * 1024LL;
int after_exec = 0;
@@ -595,6 +601,7 @@ int main(int argc, char** argv)
case 'L':
if (parse_logrotate_diskspace(optarg, &rotate_num, 
&rotate_size) < 0)
usage(NULL);
+   rotate_opt = 1;
break;
case 'd':
pcs_log_level = strtoul(optarg, &p, 10);
@@ -678,8 +685,11 @@ int main(int argc, char** argv)
usage("Invalid read cache parameters");
 
if (fstab_modify) {
-   fstab_options = make_fstab_options(timeout, logfile, 
pcs_log_level, mntflags,
-   username, groupname, mode, nodef, mntparams);
+   fstab_options = make_fstab_options(timeout, logfile, 
pcs_log_level,
+  rotate_opt ? rotate_num : 0,
+  rotate_opt ? rotate_size : 0,
+  mntflags,  username, 
+  groupname, mode, nodef, 
mntparams);
if (!fstab_options) {
pcs_log(LOG_ERR, PCS_FUSED_MSG_PREFIX"failed to make 
fstab options");
exit(252);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] fs-writeback: add endless writeback debug

2017-08-25 Thread Dmitry Monakhov

https://jira.sw.ru/browse/PSBM-69587
Signed-off-by: Dmitry Monakhov 
---
 fs/fs-writeback.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f34ae6c..9df1573 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -787,11 +787,15 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
-
+   int trace = 0;
+   
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
struct super_block *sb = inode->i_sb;
 
+   if (time_is_before_jiffies(start_time + 15* HZ))
+   trace = 1;
+
if (!grab_super_passive(sb)) {
/*
 * grab_super_passive() may fail consistently due to
@@ -799,6 +803,9 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 * requeue_io() to avoid busy retrying the inode/sb.
 */
redirty_tail(inode, wb);
+   if (trace)
+   printk("%s:%d writeback is taking too long 
ino:%ld sb(%p):%s\n",
+  __FUNCTION__, __LINE__, inode->i_ino, 
sb, sb->s_id);
continue;
}
wrote += writeback_sb_inodes(sb, wb, work);
@@ -890,6 +897,7 @@ static long wb_writeback(struct bdi_writeback *wb,
unsigned long oldest_jif;
struct inode *inode;
long progress;
+   int trace = 0;
 
oldest_jif = jiffies;
work->older_than_this = &oldest_jif;
@@ -902,6 +910,9 @@ static long wb_writeback(struct bdi_writeback *wb,
if (work->nr_pages <= 0)
break;
 
+   if (time_is_before_jiffies(wb_start + 15* HZ))
+   trace = 1;
+
/*
 * Background writeout and kupdate-style writeback may
 * run forever. Stop them if there is other work to do
@@ -973,6 +984,10 @@ static long wb_writeback(struct bdi_writeback *wb,
inode = wb_inode(wb->b_more_io.prev);
spin_lock(&inode->i_lock);
spin_unlock(&wb->list_lock);
+   if (trace)
+   printk("%s:%d writeback is taking too long 
ino:%ld st:%ld sb(%p):%s\n",
+  __FUNCTION__, __LINE__, inode->i_ino,
+  inode->i_state, inode->i_sb, 
inode->i_sb->s_id);
/* This function drops i_lock... */
inode_sleep_on_writeback(inode);
spin_lock(&wb->list_lock);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RFC] mm: Limit number of busy-looped shrinking processes

2017-09-05 Thread Dmitry Monakhov

Kirill Tkhai  writes:

> When a FUSE process is making shrink, it must not wait
> on page writeback. Otherwise, it may meet a page,
> that is being writebacked by him, and the process will stall.
>
> So, our kernel does not wait writeback after commit a9707947010d
> "mm: vmscan: never wait on writeback pages".
>
> But in case of huge number of writebacked pages and
> memory pressure, this lead to busy loop: many process
> in the system are trying to shrink memory and have
> no success. And the node shows high time, spent in kernel.
>
> This patch reduces the number of processes, which may
> busy looping on shrink. Only one userspace process --
> vstorage -- will be allowed not to sleep on writeback.
> Other processes will sleep up to 5 seconds to wait
> writeback completion on every page.
>
> The detection of vstorage is very simple and it based
> on process name. It seems, there is no a way to detect
NAK. Detection by name is very very bad design style.
fused and others should mark iself as writeback-proof explicitly
via API similar ioctl/madvice/ionice/ulimit,
may be it is reasonable to place such app to speciffic cgroup,
you may pick any recepy you like. But please do not do comm-name
matching.

> all FUSE processes, especially from !ve0, because FUSE
> mount is tricky, and a process doing mount may not be
> a FUSE daemon. So, we remain the vanila kernel behaviour,
> but we don't wait forever, just 5 second. This will save
> us from lookup messages from kernel and will allow
> to kill FUSE daemon if necessary.
>
> https://jira.sw.ru/browse/PSBM-69296
>
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5db5940bb1..e72d515c111 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -959,8 +959,16 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>  
>   /* Case 3 above */
>   } else {
> - nr_immediate++;
> - goto keep_locked;
> + /*
> +  * Currently, vstorage is the only fuse process,
> +  * exercising writeback; it mustn't sleep to 
> avoid
> +  * deadlocks.
> +  */
> + if (!strncmp(current->comm, "vstorage", 8) ||
> + wait_on_page_bit_killable_timeout(page, 
> PG_writeback, 5 * HZ) != 0) {
> + nr_immediate++;
> + goto keep_locked;
> + }
>   }
>   }
>  
> @@ -1592,9 +1600,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   if (nr_writeback && nr_writeback == nr_taken)
>   zone_set_flag(zone, ZONE_WRITEBACK);
>  
> - if (!global_reclaim(sc) && nr_immediate)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> + /*
> +  * memcg will stall in page writeback so only consider forcibly
> +  * stalling for global reclaim
> +  */
>   if (global_reclaim(sc)) {
>   /*
>* Tag a zone as congested if all the dirty pages scanned were
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 1/2] fuse: add a new async operation to unmap regions

2018-02-07 Thread Dmitry Monakhov

Andrei Vagin  writes:

> On Tue, Feb 06, 2018 at 11:49:30PM +0300, Konstantin Khorenko wrote:
>> Andrey, this seems to be a feature and it should be tested.
>> 
>> Please post here a jira id with the feature description, QA task, etc.
>
> 1. Feature
>
> Add support of discard requests via punch-holes for plain ploops
> https://pmc.acronis.com/browse/VSTOR-6962
>
> 2. Description
>
> When ploop receives a discard request, it calls fallocate() to make a
> punch hole in a ploop image file. It allows to drop useless data from a
> storage.
>
> 4. Testing
>
> [root@localhost ploop]# cat test/ploop-fdiscard.sh
> set -e -x
>
> path=$1
> mkdir -p $path
> ploop init $path/root -s 1G -f raw --sparse -t none
> out=$(ploop mount $path/DiskDescriptor.xml)
> echo $out
> dev=$(echo $out | sed "s/.*dev=\(\S*\).*/\1/")
> echo $dev
> filefrag -sv $path/root
> dd if=/dev/urandom of=$dev bs=1M count=1
> dd if=/dev/urandom of=$dev bs=1M count=1 seek=512
> fout1="$(filefrag -sv $path/root | wc -l)"
> filefrag -sv $path/root
> blkdiscard -l 1M -o 512M $dev
> filefrag -sv $path/root
> fout2="$(filefrag -sv $path/root | wc -l)"
> if [ "$fout1" -le "$fout2" ]; then
>   echo FAIL
>   exit 1
> fi
> blkdiscard $dev
> filefrag -sv $path/root
> fout3="$(filefrag -sv $path/root | wc -l)"
> if [ "$fout2" -le "$fout3" ]; then
>   echo FAIL
>   exit 1
> fi
> ploop umount -d $dev
> rm -rf $path
>
> 5. Known issues
>
> Works only for raw images on a fuse file system (vstorage)
>
> 7. Feature owner
> Andrei Vagin (avagin@)
>
>
>> 
>> And whom to review?
>
> Dima, could you review this patch set?
Ack, with minor request.
It is good moment to add stress test for rw-io vs discard
via fio. I can imagine two types of tests:
1) simple stress read/write/trim
2) integrity test via trimwrite, and  read verify after
>
>> 
>> --
>> Best regards,
>> 
>> Konstantin Khorenko,
>> Virtuozzo Linux Kernel Team
>> 
>> On 02/06/2018 03:25 AM, Andrei Vagin wrote:
>> > The fuse interface allows to run any operation asynchronously, because
>> > the kernel redirect all operations to an user daemon and then waits an
>> > answer.
>> > 
>> > In ploop, we want to handle discard requests via fallocate and
>> > a simplest way to do this is to run fallocate(FALLOC_FL_PUNCH_HOLE)
>> > asynchronously like the write command.
>> > 
>> > This patch adds a new async command IOCB_CMD_UNMAP_ITER, which sends
>> > fallocate(FALLOC_FL_PUNCH_HOLE) to a fuse user daemon.
>> > 
>> > Signed-off-by: Andrei Vagin 
>> > ---
>> >  fs/aio.c |  1 +
>> >  fs/fuse/file.c   | 63 
>> > ++--
>> >  fs/fuse/fuse_i.h |  3 +++
>> >  include/uapi/linux/aio_abi.h |  1 +
>> >  4 files changed, 60 insertions(+), 8 deletions(-)
>> > 
>> > diff --git a/fs/aio.c b/fs/aio.c
>> > index 3a6a9b0..cdc7558 100644
>> > --- a/fs/aio.c
>> > +++ b/fs/aio.c
>> > @@ -1492,6 +1492,7 @@ rw_common:
>> >ret = aio_read_iter(req);
>> >break;
>> > 
>> > +  case IOCB_CMD_UNMAP_ITER:
>> >case IOCB_CMD_WRITE_ITER:
>> >ret = aio_write_iter(req);
>> >break;
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index 877c41f..83ea9da 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -920,6 +920,19 @@ static void fuse_aio_complete_req(struct fuse_conn 
>> > *fc, struct fuse_req *req)
>> >if (!req->bvec)
>> >fuse_release_user_pages(req, !io->write);
>> > 
>> > +  if (req->in.h.opcode == FUSE_FALLOCATE) {
>> > +  if (req->out.h.error)
>> > +  printk("fuse_aio_complete_req: request (fallocate 
>> > fh=0x%llx "
>> > + "offset=%lld length=%lld mode=%x) completed with 
>> > err=%d\n",
>> > + req->misc.fallocate.in.fh,
>> > + req->misc.fallocate.in.offset,
>> > + req->misc.fallocate.in.length,
>> > + req->misc.fallocate.in.mode,
>> > + req->out.h.error);
>> > +  fuse_aio_complete(io, req->out.h.error, -1);
>> > +  return;
>> > +  }
>> > +
>> >if (io->write) {
>> >if (req->misc.write.in.size != req->misc.write.out.size)
>> >pos = req->misc.write.in.offset - io->offset +
>> > @@ -1322,6 +1335,33 @@ static void fuse_write_fill(struct fuse_req *req, 
>> > struct fuse_file *ff,
>> >req->out.args[0].value = outarg;
>> >  }
>> > 
>> > +static size_t fuse_send_unmap(struct fuse_req *req, struct fuse_io_priv 
>> > *io,
>> > +loff_t pos, size_t count, fl_owner_t owner)
>> > +{
>> > +  struct file *file = io->file;
>> > +  struct fuse_file *ff = file->private_data;
>> > +  struct fuse_conn *fc = ff->fc;
>> > +  struct fuse_fallocate_in *inarg = &req->misc.fallocate.in;
>> > +
>> > +  inarg->fh = ff->fh;
>> > +  inarg->offset = pos;
>> > +  inarg->length = count;
>> > +  inarg->mode = FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE;

Re: [Devel] [PATCH] ext4: release leaked posix acl in ext4_xattr_set_acl

2018-02-07 Thread Dmitry Monakhov

Stanislav Kinsburskiy  writes:

> Posix acl is used to convert of an extended attribute, provided by user to
> ext4 attributes. In particular to I-mode in case of ACL_TYPE_ACCESS request.
> IOW, this object is allocated, used for convertion, not stored anywhere and
> must be freed.
> However posix_acl_update_mode() can zerofy the pointer to support
> ext4_set_acl() logic, but then the object is leaked.
> So, fix it by releasing new temporary pointer with the same value instead of
> acl pointer.
So you are telling me that:
ext4_xattr_set_acl
L1 acl = posix_acl_from_xattr 
L2 -> ext4_set_acl(handle, inode, type, acl)
L3->posix_acl_update_mode(inode, &inode->i_mode, &acl)
  *acl = NULL;
  You are saying that instruction above can affect value at L1?
  HOW? acl passed to ext4_set_acl() by value, so
  posix_acl_update_mode() can affect value only in L2 and L3 but not L1. 

Stas, have you drunk a lousy beer today?
>
> https://jira.sw.ru/browse/PSBM-81384
>
> Signed-off-by: Stanislav Kinsburskiy 
> ---
>  fs/ext4/acl.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 917e819..2640d7b 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -403,7 +403,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>  {
>   struct inode *inode = dentry->d_inode;
>   handle_t *handle;
> - struct posix_acl *acl;
> + struct posix_acl *acl, *tmp;
>   int error, retries = 0;
>   int update_mode = 0;
>   umode_t mode = inode->i_mode;
> @@ -416,7 +416,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   return -EPERM;
>  
>   if (value) {
> - acl = posix_acl_from_xattr(&init_user_ns, value, size);
> + acl = tmp = posix_acl_from_xattr(&init_user_ns, value, size);
>   if (IS_ERR(acl))
>   return PTR_ERR(acl);
>   else if (acl) {
> @@ -425,7 +425,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto release_and_out;
>   }
>   } else
> - acl = NULL;
> + acl = tmp = NULL;
>  
>  retry:
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR,
> @@ -452,7 +452,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto retry;
>  
>  release_and_out:
> - posix_acl_release(acl);
> + posix_acl_release(tmp);
>   return error;
>  }
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH v2] ext4: release leaked posix acl in ext4_xattr_set_acl

2018-02-07 Thread Dmitry Monakhov

Stanislav Kinsburskiy  writes:

> Note: only rh7-3.10.0-693.17.1.el7-based kernels are affcted.
> I.e. starting from rh7-3.10.0-693.17.1.vz7.43.1.
>
> Posix acl is used to convert of an extended attribute, provided by user to
> ext4 attributes. In particular to i_mode in case of ACL_TYPE_ACCESS request.
> IOW, this object is allocated, used for convertion, not stored anywhere and
> must be freed.
> However posix_acl_update_mode() can zerofy the pointer to support
> ext4_set_acl() logic, but then the object is leaked.
> So, fix it by releasing new temporary pointer with the same value instead of
> acl pointer.
>
> https://jira.sw.ru/browse/PSBM-81384
>
> RHEL bug URL: https://bugzilla.redhat.com/show_bug.cgi?id=1543020
>
> v2: Added affected kernel version + RHEL bug URL
ACK.
>
> Signed-off-by: Stanislav Kinsburskiy 
> ---
>  fs/ext4/acl.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 917e819..f8a38a2 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -403,7 +403,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>  {
>   struct inode *inode = dentry->d_inode;
>   handle_t *handle;
> - struct posix_acl *acl;
> + struct posix_acl *acl, *real_acl;
>   int error, retries = 0;
>   int update_mode = 0;
>   umode_t mode = inode->i_mode;
> @@ -416,7 +416,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   return -EPERM;
>  
>   if (value) {
> - acl = posix_acl_from_xattr(&init_user_ns, value, size);
> + acl = real_acl = posix_acl_from_xattr(&init_user_ns, value, 
> size);
>   if (IS_ERR(acl))
>   return PTR_ERR(acl);
>   else if (acl) {
> @@ -425,7 +425,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto release_and_out;
>   }
>   } else
> - acl = NULL;
> + acl = real_acl = NULL;
>  
>  retry:
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR,
> @@ -452,7 +452,7 @@ ext4_xattr_set_acl(struct dentry *dentry, const char 
> *name, const void *value,
>   goto retry;
>  
>  release_and_out:
> - posix_acl_release(acl);
> + posix_acl_release(real_acl);
>   return error;
>  }
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH] ext4: release leaked posix acl in ext4_acl_chmod

2018-02-07 Thread Dmitry Monakhov

Stanislav Kinsburskiy  writes:

> Note: only rh7-3.10.0-693.17.1.el7-based kernels are affected.
> I.e. starting from rh7-3.10.0-693.17.1.vz7.43.1.
>
> Posix acl is used to convert of an extended attribute, provided by user to
> ext4 attributes. In particular to i_mode in case of ACL_TYPE_ACCESS
> request.
> IOW, this object is allocated, used for convertion, not stored anywhere
> and
> must be freed.
> However posix_acl_update_mode() can zerofy the pointer to support
> ext4_set_acl() logic, but then the object is leaked.
> So, fix it by releasing new temporary pointer with the same value instead
> of
> acl pointer.
>
> In scope of https://jira.sw.ru/browse/PSBM-81384
>
> RHEL bug URL: https://bugzilla.redhat.com/show_bug.cgi?id=1543020
ACK.
>
> Signed-off-by: Stanislav Kinsburskiy 
> ---
>  fs/ext4/acl.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index f8a38a2..046b338 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -297,7 +297,7 @@ ext4_init_acl(handle_t *handle, struct inode *inode, 
> struct inode *dir)
>  int
>  ext4_acl_chmod(struct inode *inode)
>  {
> - struct posix_acl *acl;
> + struct posix_acl *acl, *real_acl;
>   handle_t *handle;
>   int retries = 0;
>   int error;
> @@ -315,6 +315,8 @@ ext4_acl_chmod(struct inode *inode)
>   error = posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
>   if (error)
>   return error;
> +
> + real_acl = acl;
>  retry:
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR,
>   ext4_jbd2_credits_xattr(inode));
> @@ -341,7 +343,7 @@ ext4_acl_chmod(struct inode *inode)
>   ext4_should_retry_alloc(inode->i_sb, &retries))
>   goto retry;
>  out:
> - posix_acl_release(acl);
> + posix_acl_release(real_acl);
>   return error;
>  }
>  


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 0/3] ext4: speedup shrinking non-delay extents

2018-04-13 Thread Dmitry Monakhov

Konstantin Khorenko  writes:

> We faced a situation when all (32) cpus on a node content on sbi->s_es_lock
> shrinking extents on a single superblock and
> shrinking extents goes very slow (180 sec in average!).
>
> crash> struct ext4_sb_info 0x882fcb7ca800 -p
>
>   s_es_nr_inode = 3173832,
>   s_es_stats = {
> es_stats_shrunk = 70,
> es_stats_cache_hits = 35182748,
> es_stats_cache_misses = 2622931,
> es_stats_scan_time = 182642303461,
> es_stats_max_scan_time = 276290979674,
>
> This patchset speeds up parallel shrink a bit.
> If we findout this is not enough, next step is to limit the number of 
> shrinkers
> working on a single superslock in parallel.
>
> https://jira.sw.ru/browse/PSBM-83335
>
> Jan Kara (1):
>   ms/ext4: move handling of list of shrinkable inodes into extent status
> code
>
> Konstantin Khorenko (1):
>   ext4: don't iterate over sbi->s_es_list more than the number of
> elements
>
> Waiman Long (1):
>   ext4: Make cache hits/misses per-cpu counts
ACK.
>
>  fs/ext4/extents.c|  2 --
>  fs/ext4/extents_status.c | 56 
> +---
>  fs/ext4/extents_status.h |  6 ++
>  fs/ext4/inode.c  |  2 --
>  fs/ext4/ioctl.c  |  2 --
>  fs/ext4/super.c  |  1 -
>  6 files changed, 45 insertions(+), 24 deletions(-)
>
> -- 
> 2.15.1


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [vzlin-dev] [RH7 PATCH 1/2] port diff-ext4-in-containers-treat-panic_on_errors-as-remount-ro_on_errors

2015-06-10 Thread Dmitry Monakhov

Konstantin Khorenko  writes:

> Dima,
>
> 1) why do we need this patch now?
AFAIU primary usage for this pathes is second ploop. Am I right?
>
> Currently we have devmnt->allowed_options options which are configured via 
> userspace and currently vzctl provides empty list.
> So how it's possible that error=panic option workarounds this check?
Ok. If this is true. This patch is noop for (a) case. but we steel need
it for (b) 
> 2) if the patch is still needed, then why 2 places are required:
>a) handle_mount_opt()
>b) ext4_fill_super() - can it be called without previously calling 
> handle_mount_opt() ?
  Second one reads options directly from disk. User can modifiy it
  via tune2fs $DEV (dev should be accessiable for write inside CT)
>
>
>
> Original patch comment:
>
> Author: Konstantin Khlebnikov
> Email: khlebni...@openvz.org
> Subject: ext4: in containers treat errors=panic as
> Date: Fri, 01 Mar 2013 17:08:48 +0400
>
> Container can explode whole node if it remounts its ploop
> with option 'errors=panic' and triggers abort after that.
>
> Signed-off-by: Konstantin Khlebnikov 
> Acked-by: Maxim V. Patlasov 
>
> --
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
>
> On 06/07/2015 09:20 PM, Dmitry Monakhov wrote:
>> 
>> Signed-off-by: Dmitry Monakhov 
>> ---
>>  fs/ext4/super.c |   14 +++---
>>  1 files changed, 11 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index cbcc684..1ce2932 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -1366,6 +1366,7 @@ static int clear_qf_name(struct super_block *sb, int 
>> qtype)
>>  #define MOPT_NO_EXT20x0100
>>  #define MOPT_NO_EXT30x0200
>>  #define MOPT_EXT4_ONLY  (MOPT_NO_EXT2 | MOPT_NO_EXT3)
>> +#define MOPT_WANT_SYS_ADMIN 0x0400
>>  
>>  static const struct mount_opts {
>>  int token;
>> @@ -1394,7 +1395,7 @@ static const struct mount_opts {
>>  EXT4_MOUNT_JOURNAL_CHECKSUM),
>>   MOPT_EXT4_ONLY | MOPT_SET},
>>  {Opt_noload, EXT4_MOUNT_NOLOAD, MOPT_NO_EXT2 | MOPT_SET},
>> -{Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR},
>> +{Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | 
>> MOPT_CLEAR_ERR|MOPT_WANT_SYS_ADMIN},
>>  {Opt_err_ro, EXT4_MOUNT_ERRORS_RO, MOPT_SET | MOPT_CLEAR_ERR},
>>  {Opt_err_cont, EXT4_MOUNT_ERRORS_CONT, MOPT_SET | MOPT_CLEAR_ERR},
>>  {Opt_data_err_abort, EXT4_MOUNT_DATA_ERR_ABORT,
>> @@ -1535,6 +1536,9 @@ static int handle_mount_opt(struct super_block *sb, 
>> char *opt, int token,
>>  set_opt2(sb, EXPLICIT_DELALLOC);
>>  if (m->flags & MOPT_CLEAR_ERR)
>>  clear_opt(sb, ERRORS_MASK);
>> +if (m->flags & MOPT_WANT_SYS_ADMIN && !capable(CAP_SYS_ADMIN))
>> +return 1;
>> +
>>  if (token == Opt_noquota && sb_any_quota_loaded(sb)) {
>>  ext4_msg(sb, KERN_ERR, "Cannot change quota "
>>   "options when quota turned on");
>> @@ -3575,8 +3579,12 @@ static int ext4_fill_super(struct super_block *sb, 
>> void *data, int silent)
>>  else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK)
>>  set_opt(sb, WRITEBACK_DATA);
>>  
>> -if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC)
>> -set_opt(sb, ERRORS_PANIC);
>> +if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) {
>> +if (capable(CAP_SYS_ADMIN))
>> +set_opt(sb, ERRORS_PANIC);
>> +else
>> +set_opt(sb, ERRORS_RO);
>> +}
>>  else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE)
>>  set_opt(sb, ERRORS_CONT);
>>  else
>> 


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] ub: resurrect sync accounting

2015-06-10 Thread Dmitry Monakhov

Konstantin Khorenko  writes:

Acked-by 
> Dima, please review.
>
> --
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
>
> On 06/10/2015 11:16 AM, Vladimir Davydov wrote:
>> Related to https://jira.sw.ru/browse/PSBM-34007
>> 
>> Signed-off-by: Vladimir Davydov 
>> ---
>>  fs/sync.c | 34 +++---
>>  1 file changed, 31 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/sync.c b/fs/sync.c
>> index abad041f52a4..45649b617bae 100644
>> --- a/fs/sync.c
>> +++ b/fs/sync.c
>> @@ -229,9 +229,13 @@ int ve_fsync_behavior(void)
>>  SYSCALL_DEFINE0(sync)
>>  {
>>  struct ve_struct *ve = get_exec_env();
>> +struct user_beancounter *ub;
>>  int nowait = 0, wait = 1;
>>  unsigned long start = jiffies;
>>  
>> +ub = get_exec_ub();
>> +ub_percpu_inc(ub, sync);
>> +
>>  if (!ve_is_super(ve)) {
>>  int fsb;
>>  /*
>> @@ -258,6 +262,7 @@ SYSCALL_DEFINE0(sync)
>>  if (unlikely(laptop_mode))
>>  laptop_sync_completion();
>>  skip:
>> +ub_percpu_inc(ub, sync_done);
>>  return 0;
>>  }
>>  
>> @@ -358,9 +363,26 @@ skip:
>>   */
>>  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int 
>> datasync)
>>  {
>> +struct user_beancounter *ub;
>> +int ret;
>> +
>>  if (!file->f_op || !file->f_op->fsync)
>>  return -EINVAL;
>> -return file->f_op->fsync(file, start, end, datasync);
>> +
>> +ub = get_exec_ub();
>> +if (datasync)
>> +ub_percpu_inc(ub, fdsync);
>> +else
>> +ub_percpu_inc(ub, fsync);
>> +
>> +ret = file->f_op->fsync(file, start, end, datasync);
>> +
>> +if (datasync)
>> +ub_percpu_inc(ub, fdsync_done);
>> +else
>> +ub_percpu_inc(ub, fsync_done);
>> +
>> +return ret;
>>  }
>>  EXPORT_SYMBOL(vfs_fsync_range);
>>  
>> @@ -473,6 +495,7 @@ EXPORT_SYMBOL(generic_write_sync);
>>  SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
>>  unsigned int, flags)
>>  {
>> +struct user_beancounter *ub;
>>  int ret;
>>  struct fd f;
>>  struct address_space *mapping;
>> @@ -534,22 +557,27 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, 
>> offset, loff_t, nbytes,
>>  goto out_put;
>>  }
>>  
>> +ub = get_exec_ub();
>> +ub_percpu_inc(ub, frsync);
>> +
>>  ret = 0;
>>  if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
>>  ret = filemap_fdatawait_range(mapping, offset, endbyte);
>>  if (ret < 0)
>> -goto out_put;
>> +goto out_acct;
>>  }
>>  
>>  if (flags & SYNC_FILE_RANGE_WRITE) {
>>  ret = filemap_fdatawrite_range(mapping, offset, endbyte);
>>  if (ret < 0)
>> -goto out_put;
>> +goto out_acct;
>>  }
>>  
>>  if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
>>  ret = filemap_fdatawait_range(mapping, offset, endbyte);
>>  
>> +out_acct:
>> +ub_percpu_inc(ub, frsync_done);
>>  out_put:
>>  fdput(f);
>>  out:
>> 


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 0/7] ub: resurrect blkio prio/stats

2015-06-10 Thread Dmitry Monakhov

Konstantin Khorenko  writes:

> Dima, please review the patchset.
>
> --
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
>
> On 06/09/2015 05:06 PM, Vladimir Davydov wrote:
>> https://jira.sw.ru/browse/PSBM-34007
>> 
>> Vladimir Davydov (7):
>>   ioprio: move IOPRIO_WHO_UBC handling out of rcu section
>>   ub: zap ub_{init,fini}_ioprio
>>   ub: export ub_get_{mem,blkio}_css
>>   ub: ressurrect ioprio_set IOPRIO_WHO_UBC
>>   ub: ressurrect iostat and ioprio reporting
>>   ub: account writeback io
>>   ub: do not include block/blk-cgroup.h from io_prio.c
ACK for 1-7'th patches. The only minor question for 'ub: account writeback io'
please see comments below.
>+static int
>+__writeback_single_inode(struct inode *inode, struct writeback_control  *wbc)
>+{
>+   struct user_beancounter *ub = inode->i_mapping->dirtied_ub;
>+   int ret;
>+
>+   if (likely(get_exec_ub() == ub || !ub))
 if (ub == NULL) than we will use current content which deviate
 statistics a bit. Can we do  set_exec_ub(NULL) here?
>+  return __do_writeback_single_inode(inode, wbc);
>+
>+   ub = get_beancounter_rcu(ub) ? set_exec_ub(ub) : NULL;
>+   ret = __do_writeback_single_inode(inode, wbc);
>+   if (ub)
>+  put_beancounter(set_exec_ub(ub));
>+
>+   return ret;

>> 
>>  block/cfq-iosched.c  |  72 +++
>>  fs/fs-writeback.c|  19 +++-
>>  fs/ioprio.c  |  14 +++---
>>  include/bc/beancounter.h |  18 ++--
>>  kernel/bc/Makefile   |   2 -
>>  kernel/bc/beancounter.c  |  15 +--
>>  kernel/bc/io_prio.c  | 110 
>> ---
>>  7 files changed, 140 insertions(+), 110 deletions(-)
>> 


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH] ext4: fallocate mode - convert and extend

2015-06-23 Thread Dmitry Monakhov

Original patch rh6/diff-ext4-fallocate-mode-convert-and-extend-v3

The patch introduces new fallocate mode: FALLOC_FL_CONVERT_AND_EXTEND. It
performs two actions:
 - convert all uninitialized extends in the range 
 - set i_size to "offset + length".

The feature will be used by ploop io_direct module for optimizing submit_alloc
path.

Changed in v2 (thanks to Dima for findings):
 - moved journal start/stop into while(){...}
 - added update_fsync_trans call

Changed in v3 (thanks again to Dima for findings):
 - protected operations on extnet tree by i_data_sem

https://jira.sw.ru/browse/PSBM-22381

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/extents.c   |  135 ++-
 include/uapi/linux/falloc.h |3 +
 2 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 606a47c..dfa4e7a 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4448,6 +4448,131 @@ static void ext4_falloc_update_inode(struct inode 
*inode,
 
 }
 
+
+static int ext4_convert_and_extend_locked(struct inode *inode, loff_t offset,
+ loff_t len)
+{
+   struct ext4_ext_path *path = NULL;
+   loff_t new_size = offset + len;
+   ext4_lblk_t iblock = offset >> inode->i_blkbits;
+   ext4_lblk_t new_iblock = new_size >> inode->i_blkbits;
+   unsigned int max_blocks = new_iblock - iblock;
+   handle_t *handle;
+   unsigned int credits;
+   int err = 0;
+   int ret = 0;
+
+   if ((loff_t)iblock << inode->i_blkbits != offset ||
+   (loff_t)new_iblock << inode->i_blkbits != new_size)
+   return -EINVAL;
+
+   while (max_blocks > 0) {
+   struct ext4_extent *ex;
+   ext4_lblk_t ee_block;
+   ext4_fsblk_t ee_start;
+   unsigned short ee_len;
+   int depth;
+
+   /*
+* credits to insert 1 extents into extent tree
+*/
+   credits = ext4_chunk_trans_blocks(inode, max_blocks);
+   handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+   if (IS_ERR(handle))
+  return PTR_ERR(handle);
+
+   down_write((&EXT4_I(inode)->i_data_sem));
+
+   /* find extent for this block */
+   path = ext4_ext_find_extent(inode, iblock, NULL);
+   if (IS_ERR(path)) {
+   err = PTR_ERR(path);
+   goto done;
+   }
+
+   depth = ext_depth(inode);
+   ex = path[depth].p_ext;
+   BUG_ON(ex == NULL && depth != 0);
+
+   if (ex == NULL) {
+   err = -ENOENT;
+   goto done;
+   }
+
+   ee_block = le32_to_cpu(ex->ee_block);
+   ee_start = ext4_ext_pblock(ex);
+   ee_len = ext4_ext_get_actual_len(ex);
+   if (!in_range(iblock, ee_block, ee_len)) {
+   err = -ERANGE;
+   goto done;
+   }
+
+   if (ext4_ext_is_uninitialized(ex)) {
+   struct ext4_map_blocks map = {0};
+
+   map.m_lblk = iblock;
+   map.m_len = max_blocks;
+   err = ext4_convert_unwritten_extents_endio(handle, 
inode,
+  &map,
+  path);
+   if (err < 0)
+   goto done;
+
+   ext4_update_inode_fsync_trans(handle, inode, 1);
+   err = check_eofblocks_fl(handle, inode, iblock, path,
+max_blocks);
+   if (err)
+   goto done;
+   }
+
+
+   up_write((&EXT4_I(inode)->i_data_sem));
+
+   iblock += ee_len;
+   max_blocks -= (ee_len < max_blocks) ? ee_len : max_blocks;
+
+   if (!max_blocks && new_size > i_size_read(inode)) {
+   i_size_write(inode, new_size);
+   ext4_update_i_disksize(inode, new_size);
+   }
+
+   ret = ext4_mark_inode_dirty(handle, inode);
+done:
+   if (err)
+   up_write((&EXT4_I(inode)->i_data_sem));
+   else
+   err = ret;
+
+   if (path) {
+   ext4_ext_drop_refs(path);
+   kfree(path);
+   }
+
+   ret = ext4_journal_stop(handle);
+   if (!err && ret)
+   err = ret;
+   if (err)
+   return

[Devel] [RH7 PATCH 1/3] ext4: fix wrong assert in ext4_mb_normalize_request()

2015-06-23 Thread Dmitry Monakhov

Original commit 5b60778558

The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.

Cc: sta...@vger.kernel.org
Signed-off-by: Maurizio Lombardi 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/mballoc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dca78da..ebc7255 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3117,7 +3117,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context 
*ac,
}
BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
start > ac->ac_o_ex.fe_logical);
-   BUG_ON(size <= 0 || size > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
+   BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
 
/* now prepare goal request */
 
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 2/3] ext4: CVE-2014-8086 prevent bugon on race between write/fcntl

2015-06-23 Thread Dmitry Monakhov

Original commit a41537e69b4aa4
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
twice inside ext4_file_write_iter() and __generic_file_write() which
result in BUG_ON inside ext4_direct_IO.

Let's initialize iocb->private unconditionally.

TESTCASE: xfstest:generic/036  https://patchwork.ozlabs.org/patch/402445/


https://bugzilla.redhat.com/show_bug.cgi?id=1151353

#TYPICAL STACK TRACE:
kernel BUG at fs/ext4/inode.c:3165!
invalid opcode:  [#1] SMP
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror 
dm_region_hash dm_log dm_mod
CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 
#161
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS 
SE5C600.86B.99.99.x028.061320111235 06/13/2011
task: 88080e95a7c0 ti: 88080f908000 task.ti: 88080f908000
RIP: 0010:[]  [] ext4_direct_IO+0x162/0x3d0
RSP: 0018:88080f90bb58  EFLAGS: 00010246
RAX: 0400 RBX: 88080fdb2a28 RCX: a802c818
RDX: 0408 RSI: 88080d8aeb80 RDI: 0001
RBP: 88080f90bbc8 R08:  R09: 1581
R10:  R11:  R12: 88080d8aeb80
R13: 88080f90bbf8 R14: 88080fdb28c8 R15: 88080fdb2a28
FS:  7f23b2055700() GS:88081840() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f23b2045000 CR3: 00080cedf000 CR4: 000407e0
Stack:
 88080f90bb98  7ffe 88080fdb2c30
 0200 0200 0001 0200
 88080f90bbc8 88080fdb2c30 88080f90be08 0200
Call Trace:
 [] generic_file_direct_write+0xed/0x180
 [] __generic_file_write_iter+0x222/0x370
 [] ext4_file_write_iter+0x34b/0x400
 [] ? aio_run_iocb+0x239/0x410
 [] ? aio_run_iocb+0x239/0x410
 [] ? local_clock+0x25/0x30
 [] ? __lock_acquire+0x274/0x700
 [] ? ext4_unwritten_wait+0xb0/0xb0
 [] aio_run_iocb+0x286/0x410
 [] ? local_clock+0x25/0x30
 [] ? lock_release_holdtime+0x29/0x190
 [] ? lookup_ioctx+0x4b/0xf0
 [] do_io_submit+0x55b/0x740
 [] ? do_io_submit+0x3ca/0x740
 [] SyS_io_submit+0x10/0x20
 [] system_call_fastpath+0x16/0x1b
Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 
c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
RIP  [] ext4_direct_IO+0x162/0x3d0
 RSP 

Reported-by: Sasha Levin 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Dmitry Monakhov 
Cc: sta...@vger.kernel.org
---
 fs/ext4/file.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2ba3bec..160fceb 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -172,6 +172,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 {
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+   int overwrite = 0;
 
/*
 * If we have encountered a bitmap-format file, the size limit
@@ -192,6 +193,8 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
}
}
 
+   iocb->private = &overwrite;
+
if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
else
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 3/3] [fs] file_table: get rid of s_files and files_lock

2015-06-23 Thread Dmitry Monakhov

From: Al Viro 

backport: rh7/229.7.2/fs-file_table-get-rid-of-s_files-and-files_lock.patch
https://jira.sw.ru/browse/PSBM-34421

>From b95bd197b67d8be70a420bb08b87b816517d1935 Mon Sep 17 00:00:00 2001
[fs] file_table: get rid of s_files and files_lock

Message-id: <1415044849-10555-3-git-send-email-gdua...@redhat.com>
Patchwork-id: 99371
O-Subject: [RHEL7.1 PATCH BZ 1112805 2/2 v2] get rid of s_files and files_lock
Bugzilla: 1112805
RH-Acked-by: Mateusz Guzik 
RH-Acked-by: Jeff Moyer 

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1112805
Upstream Status: eee5cc2702929fd41cce28058dc6d6717f723f87

This version (v2) addresses Mateusz Guzik's comment:
"You leave now-unused fields unaltered which may lead to trouble in the future,
please rename them and properly ifdef with _GENKSYMS__ so that it still passes
abi checks (see s_files_deprecated in my patch, I messed up with
f_sb_list_cpu_deprecated)."

Mateusz Guzik's original patch description:

Fix a lockup of form:


[] file_sb_list_del+0x21/0x50
[] fput+0x25/0xc0
[] aio_put_req+0x2e/0x80
[] aio_complete+0x1b0/0x2b0
[.]

[] ? lg_local_lock+0x1e/0x60
[] file_sb_list_add+0x1e/0x60

Turns out the lock (and the list) are not needed, so just remove them.

Forceful read-only remount is already honored by mnt_want_write, so
there is no reason to go over all file pointers and change their open mode.

commit eee5cc2702929fd41cce28058dc6d6717f723f87
Author: Al Viro 
Date:   Fri Oct 4 11:06:42 2013 -0400

get rid of s_files and files_lock

The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.

Signed-off-by: Al Viro 

Signed-off-by: Jarod Wilson 
Signed-off-by: Dmitry Monakhov 
---
 fs/file_table.c|  122 
 fs/internal.h  |3 -
 fs/open.c  |2 -
 fs/super.c |   23 +-
 include/linux/fs.h |   14 ++
 5 files changed, 16 insertions(+), 148 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index fa537c1..5a04306 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -40,8 +40,6 @@ struct files_stat_struct files_stat = {
.max_files = NR_FILE
 };
 
-DEFINE_STATIC_LGLOCK(files_lglock);
-
 /* SLAB cache for file structures */
 static struct kmem_cache *filp_cachep __read_mostly;
 
@@ -323,7 +321,6 @@ void fput(struct file *file)
struct task_struct *task = current;
unsigned long flags;
 
-   file_sb_list_del(file);
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
init_task_work(&file->f_u.fu_rcuhead, fput);
if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
@@ -348,7 +345,6 @@ void __fput_sync(struct file *file)
 {
if (atomic_long_dec_and_test(&file->f_count)) {
struct task_struct *task = current;
-   file_sb_list_del(file);
BUG_ON(!(task->flags & PF_KTHREAD));
__fput(file);
}
@@ -360,127 +356,10 @@ void put_filp(struct file *file)
 {
if (atomic_long_dec_and_test(&file->f_count)) {
security_file_free(file);
-   file_sb_list_del(file);
file_free(file);
}
 }
 
-static inline int file_list_cpu(struct file *file)
-{
-#ifdef CONFIG_SMP
-   return file->f_sb_list_cpu;
-#else
-   return smp_processor_id();
-#endif
-}
-
-/* helper for file_sb_list_add to reduce ifdefs */
-static inline void __file_sb_list_add(struct file *file, struct super_block 
*sb)
-{
-   struct list_head *list;
-#ifdef CONFIG_SMP
-   int cpu;
-   cpu = smp_processor_id();
-   file->f_sb_list_cpu = cpu;
-   list = per_cpu_ptr(sb->s_files, cpu);
-#else
-   list = &sb->s_files;
-#endif
-   list_add(&file->f_u.fu_list, list);
-}
-
-/**
- * file_sb_list_add - add a file to the sb's file list
- * @file: file to add
- * @sb: sb to add it to
- *
- * Use this function to associate a file with the superblock of the inode it
- * refers to.
- */
-void file_sb_list_add(struct file *file, struct super_block *sb)
-{
-   lg_local_lock(&files_lglock);
-   __file_sb_list_add(file, sb);
-   lg_local_unlock(&files_lglock);
-}
-
-/**
- * file_sb_list_del - remove a file from the sb's file list
- * @file: file to remove
- * @sb: sb to remove it from
- *
- * Use this function to remove a file from its superblock.
- */
-void file_sb_list_del(struct file *file)
-{
-   if (!list_empty(&file->f_u.fu_list)) {
-   lg_local_lock_cpu(&files_lglock, file_list_cpu(file));
-   list_del_init(&file->f_u.fu_list);
-   lg_local_unlock_cpu(&files_lglock, file_list_cpu(file));
-   }
-}
-
-#ifdef CONFIG_SMP
-
-/*
- * These macro

[Devel] [RH7 PATCH 1/2] fs: check container odirect and fsync settings in __dentry_open

2015-06-24 Thread Dmitry Monakhov

sys_open for conventional filesystems doesn't call dentry_open,
it calls __dentry_open (in nameidata_to_filp), so we have to move
checks for odirect and fsync behaviour to __dentry_open
to make them working on ploop containers.

https://jira.sw.ru/browse/PSBM-17157

Signed-off-by: Dmitry Guryanov 
Acked-by: Dmitry Monakhov 
Signed-off-by: Dmitry Monakhov 
---
 fs/open.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 435794f..d64cfad 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -683,6 +683,11 @@ static int do_dentry_open(struct file *f,
struct inode *inode;
int error;
 
+   if (!may_use_odirect())
+   f->f_flags &= ~O_DIRECT;
+   if (ve_fsync_behavior() == FSYNC_NEVER)
+   f->f_flags &= ~O_SYNC;
+
f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
FMODE_PREAD | FMODE_PWRITE;
 
@@ -824,11 +829,6 @@ struct file *dentry_open(const struct path *path, int 
flags,
/* We must always pass in a valid mount pointer. */
BUG_ON(!path->mnt);
 
-   if (!may_use_odirect())
-   flags &= ~O_DIRECT;
-   if (ve_fsync_behavior() == FSYNC_NEVER)
-   flags &= ~O_SYNC;
-
f = get_empty_filp();
if (!IS_ERR(f)) {
f->f_flags = flags;
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 2/2] jbd2: raid amnesia protection for the journal

2015-06-24 Thread Dmitry Monakhov

https://jira.sw.ru/browse/PSBM-15484

Some blockdevices can return different data on read requests from same block
after power failure (for example mirrored raid is out of sync, and resync is
in progress) In that case following sutuation is possible:

Power failure happen after transaction commit log was issued for
transaction 'D', next boot first dist will have commit block, but
second one will not.
mirror1: journal={Ac-Bc-Cc-Dc }
mirror2: journal={Ac-Bc-Cc-D  }
Now let's let assumes that we read from mirror1 and found that 'D' has
valid commit block, so journal_replay will replay that transaction, but
second power failure may happen before journal_reset() so next
journal_replay() may read from mirror2 and found that 'C' is last valid
transaction. This result in corruption because we already replayed
trandaction 'D'.
In order to avoid such ambiguity we should pefrorm 'stabilize write'.
1) Read and rewrite latest commit id block
2) Invalidate next block in
order to guarantee that journal head becomes stable.

Signed-off-by: Dmitry Monakhov 
---
 fs/jbd2/recovery.c |   77 +++-
 1 files changed, 76 insertions(+), 1 deletions(-)

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 626846b..253485c 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -36,6 +36,9 @@ struct recovery_info
int nr_replays;
int nr_revokes;
int nr_revoke_hits;
+
+   unsigned intlast_log_block;
+   struct buffer_head  *last_commit_bh;
 };
 
 enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
@@ -233,6 +236,71 @@ do {   
\
var -= ((journal)->j_last - (journal)->j_first);\
 } while (0)
 
+/*
+ * The 'Raid amnesia' effect protection: https://jira.sw.ru/browse/PSBM-15484
+ *
+ * Some blockdevices can return different data on read requests from same block
+ * after power failure (for example mirrored raid is out of sync, and resync is
+ * in progress) In that case following sutuation is possible:
+ *
+ * Power failure happen after transaction commit log was issued for
+ * transaction 'D', next boot first dist will have commit block, but
+ * second one will not.
+ * mirror1: journal={Ac-Bc-Cc-Dc }
+ * mirror2: journal={Ac-Bc-Cc-D  }
+ * Now let's let assumes that we read from mirror1 and found that 'D' has
+ * valid commit block, so journal_replay will replay that transaction, but
+ * second power failure may happen before journal_reset() so next
+ * journal_replay() may read from mirror2 and found that 'C' is last valid
+ * transaction. This result in corruption because we already replayed
+ * trandaction 'D'.
+ * In order to avoid such ambiguity we should pefrorm 'stabilize write'.
+ * 1) Read and rewrite latest commit id block
+ * 2) Invalidate next block in
+ * order to guarantee that journal head becomes stable.
+ * Yes i know that 'stabilize write' approach is ugly but this is the only
+ * way to run filesystem on blkdevices with 'raid amnesia' effect
+ */
+static int stabilize_journal_head(journal_t *journal, struct recovery_info 
*info)
+{
+   struct buffer_head *bh[2] = {NULL, NULL};
+   int err, err2, i;
+
+   if (!info->last_commit_bh)
+   return 0;
+
+   bh[0] = info->last_commit_bh;
+   info->last_commit_bh = NULL;
+
+   err = jread(&bh[1], journal, info->last_log_block);
+   if (err)
+   goto out;
+
+   for (i = 0; i < 2; i++) {
+   lock_buffer(bh[i]);
+   /* Explicitly invalidate block beyond last commit block */
+   if (i == 1)
+   memset(bh[i]->b_data, 0, journal->j_blocksize);
+
+   BUFFER_TRACE(bh[i], "marking dirty");
+   set_buffer_uptodate(bh[i]);
+   mark_buffer_dirty(bh[i]);
+   BUFFER_TRACE(bh[i], "marking uptodate");
+   unlock_buffer(bh[i]);
+   }
+   err = sync_blockdev(journal->j_dev);
+   /* Make sure data is on permanent storage */
+   if (journal->j_flags & JBD2_BARRIER) {
+   err2 = blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
+   if (!err)
+   err = err2;
+   }
+out:
+   brelse(bh[0]);
+   brelse(bh[1]);
+   return err;
+}
+
 /**
  * jbd2_journal_recover - recovers a on-disk journal
  * @journal: the journal to recover
@@ -270,6 +338,8 @@ int jbd2_journal_recover(journal_t *journal)
 
err = do_one_pass(journal, &info, PASS_SCAN);
if (!err)
+   err = stabilize_journal_head(journal, &info);
+   if (!err)
err = do_one_pass(journal, &info, PASS_REVOKE);

[Devel] [RH7 PATCH 1/6] vfs: add support for a lazytime mount option

2015-06-25 Thread Dmitry Monakhov

ML-commit: 0ae45f63d4ef8d8eeec49c7d8b44a1775fff13e8

Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact long tail latencies --- which
is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052
PSBM-Bug-id: https://jira.sw.ru/browse/PSBM-20411

Signed-off-by: Theodore Ts'o 
Signed-off-by: Al Viro 
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/inode.c  |6 
 fs/fs-writeback.c|   62 +++--
 fs/gfs2/file.c   |4 +-
 fs/inode.c   |   56 --
 fs/jfs/file.c|2 +-
 fs/libfs.c   |2 +-
 fs/proc_namespace.c  |1 +
 fs/sync.c|9 +
 include/linux/backing-dev.h  |1 +
 include/linux/fs.h   |5 +++
 include/trace/events/writeback.h |   60 -
 include/uapi/linux/fs.h  |4 ++-
 mm/backing-dev.c |   10 +-
 13 files changed, 188 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3f520b3..b8173af 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5105,11 +5105,17 @@ int ext4_mark_inode_dirty(handle_t *handle, struct 
inode *inode)
  * If the inode is marked synchronous, we don't honour that here - doing
  * so would cause a commit on atime updates, which we don't bother doing.
  * We handle synchronous inodes at the highest possible level.
+ *
+ * If only the I_DIRTY_TIME flag is set, we can skip everything.  If
+ * I_DIRTY_TIME and I_DIRTY_SYNC is set, the only inode fields we need
+ * to copy into the on-disk inode structure are the timestamp files.
  */
 void ext4_dirty_inode(struct inode *inode, int flags)
 {
handle_t *handle;
 
+   if (flags == I_DIRTY_TIME)
+   return;
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
if (IS_ERR(handle))
goto out;
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8161c40..5355fad 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -241,14 +241,19 @@ static bool inode_dirtied_after(struct inode *inode, 
unsigned long t)
return ret;
 }
 
+#define EXPIRE_DIRTY_ATIME 0x0001
+
 /*
  * Move expired (dirtied before work->older_than_this) dirty inodes from
  * @delaying_queue to @dispatch_queue.
  */
 static int move_expired_inodes(struct list_head *delaying_queue,
   struct list_head *dispatch_queue,
+  int flags,
   struct wb_writeback_work *work)
 {
+   unsigned long *older_than_this = NULL;
+   unsigned long expire_time;
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
@@ -256,13 +261,25 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
int do_sb_sort = 0;
int moved = 0;
 
+
+   if ((flags & EXPIRE_DIRTY_ATIME) == 0)
+   older_than_this = &work->older_than_this;
+   else if ((work->reason == WB_REASON_SYNC) == 0) {
+   expire_time = jiffies - (HZ * 86400);
+   older_than_this = &expire_time;
+   }
+
WARN_ON_ONCE(!work->older_than_this_is_set);
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
if (inode_dirtied_after(inode, work->older_than_this))
break;
+
list_move(&inode->i_wb_list, &tmp);
moved++;
+   if (flags & EXPIRE_DIRTY_ATIME)
+   set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state);
+
if (sb_is_blkdev_sb(inode->i_sb))
continue;
if (sb && sb != inode->i_sb)
@@ -303,9 +320,12 @@ out:
 static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 {
int moved;

[Devel] [RH7 PATCH 2/6] vfs: add find_inode_nowait() function

2015-06-25 Thread Dmitry Monakhov

ML-commit: fe032c422c5ba562ba9c2d316f55e258e03259c6 Mon Sep 17 00:00:00 2001

Add a new function find_inode_nowait() which is an even more general
version of ilookup5_nowait().  It is designed for callers which need
very fine grained control over when the function is allowed to block
or increment the inode's reference count.

Signed-off-by: Theodore Ts'o 
Signed-off-by: Al Viro 
Signed-off-by: Dmitry Monakhov 
---
 fs/inode.c |   50 ++
 include/linux/fs.h |5 +
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5d73316..297951b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1298,6 +1298,56 @@ struct inode *ilookup(struct super_block *sb, unsigned 
long ino)
 }
 EXPORT_SYMBOL(ilookup);
 
+/**
+ * find_inode_nowait - find an inode in the inode cache
+ * @sb:super block of file system to search
+ * @hashval:   hash value (usually inode number) to search for
+ * @match: callback used for comparisons between inodes
+ * @data:  opaque data pointer to pass to @match
+ *
+ * Search for the inode specified by @hashval and @data in the inode
+ * cache, where the helper function @match will return 0 if the inode
+ * does not match, 1 if the inode does match, and -1 if the search
+ * should be stopped.  The @match function must be responsible for
+ * taking the i_lock spin_lock and checking i_state for an inode being
+ * freed or being initialized, and incrementing the reference count
+ * before returning 1.  It also must not sleep, since it is called with
+ * the inode_hash_lock spinlock held.
+ *
+ * This is a even more generalized version of ilookup5() when the
+ * function must never block --- find_inode() can block in
+ * __wait_on_freeing_inode() --- or when the caller can not increment
+ * the reference count because the resulting iput() might cause an
+ * inode eviction.  The tradeoff is that the @match funtion must be
+ * very carefully implemented.
+ */
+struct inode *find_inode_nowait(struct super_block *sb,
+   unsigned long hashval,
+   int (*match)(struct inode *, unsigned long,
+void *),
+   void *data)
+{
+   struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+   struct inode *inode, *ret_inode = NULL;
+   int mval;
+
+   spin_lock(&inode_hash_lock);
+   hlist_for_each_entry(inode, head, i_hash) {
+   if (inode->i_sb != sb)
+   continue;
+   mval = match(inode, hashval, data);
+   if (mval == 0)
+   continue;
+   if (mval == 1)
+   ret_inode = inode;
+   goto out;
+   }
+out:
+   spin_unlock(&inode_hash_lock);
+   return ret_inode;
+}
+EXPORT_SYMBOL(find_inode_nowait);
+
 int insert_inode_locked(struct inode *inode)
 {
struct super_block *sb = inode->i_sb;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6397d36..041edd2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2571,6 +2571,11 @@ extern struct inode *ilookup(struct super_block *sb, 
unsigned long ino);
 
 extern struct inode * iget5_locked(struct super_block *, unsigned long, int 
(*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);
+extern struct inode *find_inode_nowait(struct super_block *,
+  unsigned long,
+  int (*match)(struct inode *,
+   unsigned long, void *),
+  void *data);
 extern int insert_inode_locked4(struct inode *, unsigned long, int 
(*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 3/6] ext4: add optimization for the lazytime mount option

2015-06-25 Thread Dmitry Monakhov

ML-commit a26f49926da938f47561f386be56a83dd37a496d Mon Sep 17 00:00:00 2001

Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
Signed-off-by: Al Viro 
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/inode.c |   60 +++
 fs/ext4/super.c |   10 +++
 include/trace/events/ext4.h |   30 +
 3 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b8173af..137e828 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4437,6 +4437,63 @@ static int ext4_inode_blocks_set(handle_t *handle,
return 0;
 }
 
+struct other_inode {
+   unsigned long   orig_ino;
+   struct ext4_inode   *raw_inode;
+};
+
+static int other_inode_match(struct inode * inode, unsigned long ino,
+void *data)
+{
+   struct other_inode *oi = (struct other_inode *) data;
+
+   if ((inode->i_ino != ino) ||
+   (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
+  I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
+   ((inode->i_state & I_DIRTY_TIME) == 0))
+   return 0;
+   spin_lock(&inode->i_lock);
+   if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
+   I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0) &&
+   (inode->i_state & I_DIRTY_TIME)) {
+   struct ext4_inode_info  *ei = EXT4_I(inode);
+
+   inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED);
+
+   EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
+   ext4_inode_csum_set(inode, oi->raw_inode, ei);
+   spin_unlock(&inode->i_lock);
+   trace_ext4_other_inode_update_time(inode, oi->orig_ino);
+   return -1;
+   }
+   spin_unlock(&inode->i_lock);
+   return -1;
+}
+
+/*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct other_inode oi;
+   unsigned long ino;
+   int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   oi.orig_ino = orig_ino;
+   ino = orig_ino & ~(inodes_per_block - 1);
+   for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   oi.raw_inode = (struct ext4_inode *) buf;
+   (void) find_inode_nowait(sb, ino, other_inode_match, &oi);
+   }
+}
+
 /*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
@@ -4553,6 +4610,9 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
ext4_inode_csum_set(inode, raw_inode, ei);
+   if (inode->i_sb->s_flags & MS_LAZYTIME)
+   ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
+ bh->b_data);
 
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 80975b7..194f271 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1164,6 +1164,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1227,6 +1228,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+   {Opt_lazytime, "lazytime"},
+   {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit&

[Devel] [RH7 PATCH 4/6] fs: make sure the timestamps for lazytime inodes eventually get written

2015-06-25 Thread Dmitry Monakhov

ML-commit: a2f4870697a5bcf4a87073ec6b32dd2928c1211d Mon Sep 17 00:00:00 2001

Jan Kara pointed out that if there is an inode which is constantly
getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
will never be written since inode->dirtied_when is constantly getting
updated.  We fix this by adding an extra field to the inode,
dirtied_time_when, so inodes with a stale dirtytime can get detected
and handled.

In addition, if we have a dirtytime inode caused by an atime update,
and there is no write activity on the file system, we need to have a
secondary system to make sure these inodes get written out.  We do
this by setting up a second delayed work structure which wakes up the
CPU much more rarely compared to writeback_expire_centisecs.

Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Signed-off-by: Dmitry Monakhov 
---
 fs/fs-writeback.c  |   83 +--
 include/linux/fs.h |1 +
 2 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5355fad..d7fb340 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -59,6 +59,18 @@ struct wb_writeback_work {
struct completion *done;/* set if the caller waits */
 };
 
+/*
+ * If an inode is constantly having its pages dirtied, but then the
+ * updates stop dirtytime_expire_interval seconds in the past, it's
+ * possible for the worst case time between when an inode has its
+ * timestamps updated and when they finally get written out to be two
+ * dirtytime_expire_intervals.  We set the default to 12 hours (in
+ * seconds), which means most of the time inodes will have their
+ * timestamps written to disk after 12 hours, but in the worst case a
+ * few inodes might not their timestamps updated for 24 hours.
+ */
+unsigned int dirtytime_expire_interval = 12 * 60 * 60;
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -264,8 +276,8 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
 
if ((flags & EXPIRE_DIRTY_ATIME) == 0)
older_than_this = &work->older_than_this;
-   else if ((work->reason == WB_REASON_SYNC) == 0) {
-   expire_time = jiffies - (HZ * 86400);
+   else if (!work->for_sync) {
+   expire_time = jiffies - (dirtytime_expire_interval * HZ);
older_than_this = &expire_time;
}
 
@@ -449,6 +461,7 @@ static void requeue_inode(struct inode *inode, struct 
bdi_writeback *wb,
 */
redirty_tail(inode, wb);
} else if (inode->i_state & I_DIRTY_TIME) {
+   inode->dirtied_when = jiffies;
list_move(&inode->i_wb_list, &wb->b_dirty_time);
} else {
/* The inode is clean. Remove from writeback lists. */
@@ -497,13 +510,19 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
+
dirty = inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC);
-   if (((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) &&
-(inode->i_state & I_DIRTY_TIME)) ||
-   (inode->i_state & I_DIRTY_TIME_EXPIRED)) {
-   dirty |= I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED;
-   trace_writeback_lazytime(inode);
-   }
+   if (inode->i_state & I_DIRTY_TIME) {
+   if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
+   unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) ||
+   unlikely(time_after(jiffies,
+   (inode->dirtied_time_when +
+dirtytime_expire_interval * HZ {
+   dirty |= I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED;
+   trace_writeback_lazytime(inode);
+   }
+   } else
+   inode->i_state &= ~I_DIRTY_TIME_EXPIRED;
inode->i_state &= ~dirty;
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
@@ -1122,6 +1141,45 @@ void wakeup_flusher_threads(long nr_pages, enum 
wb_reason reason)
rcu_read_unlock();
 }
 
+/*
+ * Wake up bdi's periodically to make sure dirtytime inodes gets
+ * written back periodically.  We deliberately do *not* check the
+ * b_dirtytime list in wb_has_dirty_io(), since this would cause the
+ * kernel to be constantly waking up once there are any dirtytime
+ * inodes on the system.  So instead we define a separate delayed work
+ * function which gets called much more rarely.  (By default, only
+ * once every 12 hours.)
+ *
+ *

[Devel] [RH7 PATCH 5/6] fs: add dirtytime_expire_seconds sysctl

2015-06-25 Thread Dmitry Monakhov

ML-commit: 1efff914afac8a965ad63817ecf8861a927c2ace Mon Sep 17 00:00:00 2001

Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.

Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Signed-off-by: Dmitry Monakhov 
---
 fs/fs-writeback.c |   11 +++
 include/linux/writeback.h |3 +++
 kernel/sysctl.c   |8 
 3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d7fb340..c390188 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1180,6 +1180,17 @@ static int __init start_dirtytime_writeback(void)
 }
 __initcall(start_dirtytime_writeback);
 
+int dirtytime_interval_handler(struct ctl_table *table, int write,
+  void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+
+   ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   if (ret == 0 && write)
+   mod_delayed_work(system_wq, &dirtytime_work, 0);
+   return ret;
+}
+
 static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 {
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d00fdc0..32e0bf8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -135,6 +135,7 @@ extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
 extern unsigned int dirty_expire_interval;
+extern unsigned int dirtytime_expire_interval;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
@@ -151,6 +152,8 @@ extern int dirty_ratio_handler(struct ctl_table *table, int 
write,
 extern int dirty_bytes_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
+int dirtytime_interval_handler(struct ctl_table *table, int write,
+  void __user *buffer, size_t *lenp, loff_t *ppos);
 
 struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fe20216..2380136 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1246,6 +1246,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
{
+   .procname   = "dirtytime_expire_seconds",
+   .data   = &dirtytime_expire_interval,
+   .maxlen = sizeof(dirty_expire_interval),
+   .mode   = 0644,
+   .proc_handler   = dirtytime_interval_handler,
+   .extra1 = &zero,
+   },
+   {
.procname   = "nr_pdflush_threads",
.mode   = 0444 /* read-only */,
.proc_handler   = pdflush_proc_obsolete,
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 6/6] ext4: fix lazytime optimization

2015-06-25 Thread Dmitry Monakhov

ML-commit 8f4d855839179f410fa910a26eb81d646d628f26 Mon Sep 17 00:00:00 2001

We had a fencepost error in the lazytime optimization which means that
timestamp would get written to the wrong inode.

Signed-off-by: Theodore Ts'o 
Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/inode.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 137e828..c0300e9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4485,7 +4485,7 @@ static void ext4_update_other_inodes_time(struct 
super_block *sb,
int inode_size = EXT4_INODE_SIZE(sb);
 
oi.orig_ino = orig_ino;
-   ino = orig_ino & ~(inodes_per_block - 1);
+   ino = (orig_ino & ~(inodes_per_block - 1)) + 1;
for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
if (ino == orig_ino)
continue;
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH7 v2 1/4] ve: remove sysctl_fsync_enable and use ve_fsync_behavior instead

2015-06-30 Thread Dmitry Monakhov

Pavel Tikhomirov  writes:

> - sysctl_fsync_enable is always = 2 and checking it is meaningless
> - we already changed it with ve_fsync_behavior in setfl and dentry_open
> - in do_fsync and syncfs we have both checks
> - in msync replace
> - in sync_file_range we don't need replacement according to patch
>   diff-ve-fsync-behavior-sanitize:
> * Don't filter syncs in sync_file_range, since this syscall is
> not technically sync, then name is misleading
>
> Reviewed-by: Vladimir Davydov 
> Signed-off-by: Pavel Tikhomirov 
ACK
> ---
>  fs/sync.c  | 9 -
>  include/linux/fs.h | 1 -
>  mm/msync.c | 2 +-
>  3 files changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/fs/sync.c b/fs/sync.c
> index 45649b6..b5a2f58 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -24,8 +24,6 @@
>  #define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
>   SYNC_FILE_RANGE_WAIT_AFTER)
>  
> -int sysctl_fsync_enable = 2;
> -
>  /*
>   * Do the filesystem syncing work. For simple filesystems
>   * writeback_inodes_sb(sb) just dirties buffers with inodes so we have to
> @@ -327,8 +325,6 @@ SYSCALL_DEFINE1(syncfs, int, fd)
>   if (is_child_reaper(task_pid(current)))
>   goto fdput;
>  
> - if (!sysctl_fsync_enable)
> - goto fdput;
>   fsb = __ve_fsync_behavior(ve);
>   if (fsb == FSYNC_NEVER)
>   goto fdput;
> @@ -405,8 +401,6 @@ static int do_fsync(unsigned int fd, int datasync)
>   struct fd f;
>   int ret = -EBADF;
>  
> - if (!ve_is_super(get_exec_env()) && !sysctl_fsync_enable)
> - return 0;
>   if (ve_fsync_behavior() == FSYNC_NEVER)
>   return 0;
>  
> @@ -502,9 +496,6 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, 
> loff_t, nbytes,
>   loff_t endbyte; /* inclusive */
>   umode_t i_mode;
>  
> - if (!ve_is_super(get_exec_env()) && !sysctl_fsync_enable)
> - return 0;
> -
>   ret = -EINVAL;
>   if (flags & ~VALID_FLAGS)
>   goto out;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fdade5c..9bdf99f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -58,7 +58,6 @@ extern struct inodes_stat_t inodes_stat;
>  extern int leases_enable, lease_break_time;
>  extern int sysctl_protected_symlinks;
>  extern int sysctl_protected_hardlinks;
> -extern int sysctl_fsync_enable;
>  
>  struct buffer_head;
>  typedef int (get_block_t)(struct inode *inode, sector_t iblock,
> diff --git a/mm/msync.c b/mm/msync.c
> index b7d634a..f47b2d7 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -48,7 +48,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, 
> int, flags)
>   if (end < start)
>   goto out;
>   error = 0;
> - if (!ve_is_super(get_exec_env()) && !sysctl_fsync_enable)
> + if (ve_fsync_behavior() == FSYNC_NEVER)
>   goto out;
>   if (end == start)
>   goto out;
> -- 
> 1.9.3


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH7 v2 2/4] ve: initialize fsync_enable also for non ve0 environment

2015-06-30 Thread Dmitry Monakhov

Pavel Tikhomirov  writes:

> v2: only on ve cgroup creation
ACK
>
> https://jira.sw.ru/browse/PSBM-34286
> Signed-off-by: Pavel Tikhomirov 
> ---
>  kernel/ve/ve.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index 8bbba1f..212c781 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -637,6 +637,8 @@ static struct cgroup_subsys_state *ve_create(struct 
> cgroup *cg)
>   if (!ve->ve_name)
>   goto err_name;
>  
> + ve->fsync_enable = 2;
> +
>   ve->sched_lat_ve.cur = alloc_percpu(struct kstat_lat_pcpu_snap_struct);
>   if (!ve->sched_lat_ve.cur)
>   goto err_lat;
> -- 
> 1.9.3


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RH7 v2 4/4] ve: cgroup: initialize odirect_enable, features and _randomize_va_space

2015-06-30 Thread Dmitry Monakhov

Pavel Tikhomirov  writes:

> v2: move intitialization from init_ve_struct to ve_create, remove
> get_ve_features
>
> Signed-off-by: Pavel Tikhomirov 
> ---
>  kernel/ve/ve.c  |  5 +
>  kernel/ve/vecalls.c | 23 ---
>  2 files changed, 5 insertions(+), 23 deletions(-)
ACK
>
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index 802dc79..e50b9ee 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -665,6 +665,11 @@ static struct cgroup_subsys_state *ve_create(struct 
> cgroup *cg)
>   if (!ve->ve_name)
>   goto err_name;
>  
> + ve->_randomize_va_space = ve0._randomize_va_space;
> +
> + ve->features = VE_FEATURES_DEF;
> +
> + ve->odirect_enable = 2;
>   ve->fsync_enable = 2;
>  
>  #ifdef CONFIG_VE_IPTABLES
> diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
> index 71ee93d..b171492 100644
> --- a/kernel/ve/vecalls.c
> +++ b/kernel/ve/vecalls.c
> @@ -205,33 +205,10 @@ static inline int init_ve_namespaces(void)
>   return 0;
>  }
>  
> -static __u64 get_ve_features(env_create_param_t *data, int datalen)
> -{
> - __u64 known_features;
> -
> - if (datalen < sizeof(struct env_create_param3))
> - /* this version of vzctl is aware of VE_FEATURES_OLD only */
> - known_features = VE_FEATURES_OLD;
> - else
> - known_features = data->known_features;
> -
> - /*
> -  * known features are set as required
> -  * yet unknown features are set as in VE_FEATURES_DEF
> -  */
> - return (data->feature_mask & known_features) |
> - (VE_FEATURES_DEF & ~known_features);
> -}
> -
>  static int init_ve_struct(struct ve_struct *ve,
>   u32 class_id, env_create_param_t *data, int datalen)
>  {
>   ve->class_id = class_id;
> - ve->features = get_ve_features(data, datalen);
> -
> - ve->_randomize_va_space = ve0._randomize_va_space;
> -
> - ve->odirect_enable = 2;
>  
>  #ifdef CONFIG_VE_IPTABLES
>   /* Set up ipt_mask as it will be used during
> -- 
> 1.9.3


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 1/3] vfs: fix data corruption when blocksize < pagesize for mmaped data

2015-07-01 Thread Dmitry Monakhov

Cyrill Gorcunov  writes:

> From: Jan Kara 
>
> ->page_mkwrite() is used by filesystems to allocate blocks under a page
> which is becoming writeably mmapped in some process' address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than
> silently discarding data later when writepage is called.
>
> However VFS fails to call ->page_mkwrite() in all the cases where
> filesystems need it when blocksize < pagesize. For example when
> blocksize = 1024, pagesize = 4096 the following is problematic:
>   ftruncate(fd, 0);
>   pwrite(fd, buf, 1024, 0);
>   map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
>   map[0] = 'a';   > page_mkwrite() for index 0 is called
>   ftruncate(fd, 1); /* or even pwrite(fd, buf, 1, 1) */
>   mremap(map, 1024, 1, 0);
>   map[4095] = 'a';> no page_mkwrite() called
>
> At the moment ->page_mkwrite() is called, filesystem can allocate only
> one block for the page because i_size == 1024. Otherwise it would create
> blocks beyond i_size which is generally undesirable. But later at
> ->writepage() time, we also need to store data at offset 4095 but we
> don't have block allocated for it.
>
> This patch introduces a helper function filesystems can use to have
> ->page_mkwrite() called at all the necessary moments.
>
> gorcunov@:
>  - ML 90a8020278c1598fafd071736a0846b38510309c
>  - https://jira.sw.ru/browse/PSBM-34383
>
> Signed-off-by: Jan Kara 
> Signed-off-by: Theodore Ts'o 
> Signed-off-by: Cyrill Gorcunov 
> ---
>  fs/buffer.c|3 ++
>  include/linux/mm.h |1 
>  mm/truncate.c  |   56 
> -
>  3 files changed, 59 insertions(+), 1 deletion(-)
>
> Index: linux-pcs7.git/fs/buffer.c
> ===
> --- linux-pcs7.git.orig/fs/buffer.c
> +++ linux-pcs7.git/fs/buffer.c
> @@ -2056,6 +2056,7 @@ int generic_write_end(struct file *file,
>   struct page *page, void *fsdata)
>  {
>   struct inode *inode = mapping->host;
> + loff_t old_size = inode->i_size;
>   int i_size_changed = 0;
>  
>   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> @@ -2075,6 +2076,8 @@ int generic_write_end(struct file *file,
>   unlock_page(page);
>   page_cache_release(page);
>  
> + if (old_size < pos)
> + pagecache_isize_extended(inode, old_size, pos);
>   /*
>* Don't mark the inode dirty under page lock. First, it unnecessarily
>* makes the holding time of page lock longer. Second, it forces lock
> Index: linux-pcs7.git/include/linux/mm.h
> ===
> --- linux-pcs7.git.orig/include/linux/mm.h
> +++ linux-pcs7.git/include/linux/mm.h
> @@ -1072,6 +1072,7 @@ static inline void unmap_shared_mapping_
>  
>  extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new);
>  extern void truncate_setsize(struct inode *inode, loff_t newsize);
> +void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
>  void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t 
> end);
>  int truncate_inode_page(struct address_space *mapping, struct page *page);
>  int generic_error_remove_page(struct address_space *mapping, struct page 
> *page);
> Index: linux-pcs7.git/mm/truncate.c
> ===
> --- linux-pcs7.git.orig/mm/truncate.c
> +++ linux-pcs7.git/mm/truncate.c
> @@ -20,6 +20,7 @@
>  #include/* grr. try_to_release_page,
>  do_invalidatepage */
>  #include 
> +#include 
>  #include "internal.h"
>  
>  static void clear_exceptional_entry(struct address_space *mapping,
> @@ -689,12 +690,65 @@ void truncate_setsize(struct inode *inod
>  
>   oldsize = inode->i_size;
>   i_size_write(inode, newsize);
> -
> + if (newsize > oldsize)
> + pagecache_isize_extended(inode, oldsize, newsize);
>   truncate_pagecache(inode, oldsize, newsize);
>  }
>  EXPORT_SYMBOL(truncate_setsize);
>  
>  /**
> + * pagecache_isize_extended - update pagecache after extension of i_size
> + * @inode:   inode for which i_size was extended
> + * @from:original inode size
> + * @to:  new inode size
> + *
> + * Handle extension of inode size either caused by extending truncate or by
> + * write starting after current i_size. We mark the page straddling current
> + * i_size RO so that page_mkwrite() is called on the nearest write access to
> + * the page.  This way filesystem can be sure that page_mkwrite() is called 
> on
> + * the page before user writes to the page via mmap after the i_size has been
> + * changed.
> + *
> + * The function must be called after i_size is updated so that page fault
> + * coming after we unlock the page will already see the new i_size

Re: [Devel] [PATCH 2/3] ext4: fix lost truncate due to race with writeback

2015-07-01 Thread Dmitry Monakhov

Cyrill Gorcunov  writes:

> From: Jan Kara 
>
> The following race can lead to a loss of i_disksize update from truncate
> thus resulting in a wrong inode size if the inode size isn't updated
> again before inode is reclaimed:
>
> ext4_setattr()mpage_map_and_submit_extent()
>   EXT4_I(inode)->i_disksize = attr->ia_size;
>   ...   ...
> disksize = ((loff_t)mpd->first_page) 
> << PAGE_CACHE_SHIFT
> /* False because i_size isn't
>  * updated yet */
> if (disksize > i_size_read(inode))
> /* True, because i_disksize is
>  * already truncated */
> if (disksize > 
> EXT4_I(inode)->i_disksize)
>   /* Overwrite i_disksize
>* update from truncate */
>   ext4_update_i_disksize()
>   i_size_write(inode, attr->ia_size);
>
> For other places updating i_disksize such race cannot happen because
> i_mutex prevents these races. Writeback is the only place where we do
> not hold i_mutex and we cannot grab it there because of lock ordering.
>
> We fix the race by doing both i_disksize and i_size update in truncate
> atomically under i_data_sem and in mpage_map_and_submit_extent() we move
> the check against i_size under i_data_sem as well.
>
> gorcunov@:
>  - ML 90e775b71ac4e685898c7995756fe58c135adaa6
>  - https://jira.sw.ru/browse/PSBM-34383
>
> Signed-off-by: Jan Kara 
> Signed-off-by: "Theodore Ts'o" 
> Signed-off-by: Cyrill Gorcunov 
ACK
> ---
>  fs/ext4/ext4.h  |   24 
>  fs/ext4/inode.c |   15 ---
>  2 files changed, 32 insertions(+), 7 deletions(-)
>
> Index: linux-pcs7.git/fs/ext4/ext4.h
> ===
> --- linux-pcs7.git.orig/fs/ext4/ext4.h
> +++ linux-pcs7.git/fs/ext4/ext4.h
> @@ -2400,16 +2400,32 @@ do {  
> \
>  #define EXT4_FREECLUSTERS_WATERMARK 0
>  #endif
>  
> +/* Update i_disksize. Requires i_mutex to avoid races with truncate */
>  static inline void ext4_update_i_disksize(struct inode *inode, loff_t 
> newsize)
>  {
> - /*
> -  * XXX: replace with spinlock if seen contended -bzzz
> -  */
> + WARN_ON_ONCE(S_ISREG(inode->i_mode) &&
> +  !mutex_is_locked(&inode->i_mutex));
>   down_write(&EXT4_I(inode)->i_data_sem);
>   if (newsize > EXT4_I(inode)->i_disksize)
>   EXT4_I(inode)->i_disksize = newsize;
>   up_write(&EXT4_I(inode)->i_data_sem);
> - return ;
> +}
> +
> +/*
> + * Update i_disksize after writeback has been started. Races with truncate
> + * are avoided by checking i_size under i_data_sem.
> + */
> +static inline void ext4_wb_update_i_disksize(struct inode *inode, loff_t 
> newsize)
> +{
> + loff_t i_size;
> +
> + down_write(&EXT4_I(inode)->i_data_sem);
> + i_size = i_size_read(inode);
> + if (newsize > i_size)
> + newsize = i_size;
> + if (newsize > EXT4_I(inode)->i_disksize)
> + EXT4_I(inode)->i_disksize = newsize;
> + up_write(&EXT4_I(inode)->i_data_sem);
>  }
>  
>  struct ext4_group_info {
> Index: linux-pcs7.git/fs/ext4/inode.c
> ===
> --- linux-pcs7.git.orig/fs/ext4/inode.c
> +++ linux-pcs7.git/fs/ext4/inode.c
> @@ -1788,7 +1788,7 @@ static void mpage_da_map_and_submit(stru
>   if (disksize > i_size_read(mpd->inode))
>   disksize = i_size_read(mpd->inode);
>   if (disksize > EXT4_I(mpd->inode)->i_disksize) {
> - ext4_update_i_disksize(mpd->inode, disksize);
> + ext4_wb_update_i_disksize(mpd->inode, disksize);
>   err = ext4_mark_inode_dirty(handle, mpd->inode);
>   if (err)
>   ext4_error(mpd->inode->i_sb,
> @@ -4831,18 +4831,27 @@ int ext4_setattr(struct dentry *dentry,
>   error = ext4_orphan_add(handle, inode);
>   orphan = 1;
>   }
> + down_write(&EXT4_I(inode)->i_data_sem);
>   EXT4_I(inode)->i_disksize = attr->ia_size;
>   rc = ext4_mark_inode_dirty(handle, inode);
>   if (!error)
>   error = rc;
> + /*
> +  * We have to update i_size under i_data_sem together
> +  * with i_disksize to avoid races with writeback code
> +  * running ext4_wb_update_i_disksize().
> +  */
> + if (!error)
> + i_size_write(

Re: [Devel] [PATCH rh7 3/3] ext4: fix mmap data corruption when blocksize < pagesize

2015-07-01 Thread Dmitry Monakhov

Cyrill Gorcunov  writes:

> From: Jan Kara 
>
> Use truncate_isize_extended() when hole is being created in a file so that
> ->page_mkwrite() will get called for the partial tail page if it is
> mmaped (see the first patch in the series for details).
>
> gorcunov@:
>  - ML d6320cbfc92910a3e5f10c42d98c231c98db4f60
>  - https://jira.sw.ru/browse/PSBM-34383
>
> Signed-off-by: Jan Kara 
> Signed-off-by: Theodore Ts'o 
> Signed-off-by: Cyrill Gorcunov 
ACK
> ---
>  fs/ext4/inode.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> Index: linux-pcs7.git/fs/ext4/inode.c
> ===
> --- linux-pcs7.git.orig/fs/ext4/inode.c
> +++ linux-pcs7.git/fs/ext4/inode.c
> @@ -4849,8 +4849,12 @@ int ext4_setattr(struct dentry *dentry,
>   ext4_orphan_del(NULL, inode);
>   goto err_out;
>   }
> - } else
> + } else {
> + loff_t oldsize = inode->i_size;
> +
>   i_size_write(inode, attr->ia_size);
> + pagecache_isize_extended(inode, oldsize, inode->i_size);
> + }
>  
>   /*
>* Blocks are going to be removed from the inode. Wait


signature.asc
Description: PGP signature
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH] ext4: ext4_ext_drop_refs ignode null path

2015-07-02 Thread Dmitry Monakhov

This hank was part of following patch
[RH7 PATCH 10/10] ext4: update defragmentation codebase
Date: Thu, 18 Jun 2015 15:42:57 +0400
Message-Id: <143462-3815-11-git-send-email-dmonak...@openvz.org>

But by unknown reason it was lost somewhere. This result in panic
of xfstests ext4/304

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/extents.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 37d04d3..8b4a7fc 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -584,9 +584,12 @@ static void ext4_ext_show_move(struct inode *inode, struct 
ext4_ext_path *path,
 
 void ext4_ext_drop_refs(struct ext4_ext_path *path)
 {
-   int depth = path->p_depth;
+   int depth;
int i;
 
+   if (!path)
+   return;
+   depth = path->p_depth;
for (i = 0; i <= depth; i++, path++)
if (path->p_bh) {
brelse(path->p_bh);
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH] cfq-iosched: Fix wrong children_weight calculation

2015-07-17 Thread Dmitry Monakhov

From: Toshiaki Makita 

cfq_group_service_tree_add() is applying new_weight at the beginning of
the function via cfq_update_group_weight().
This actually allows weight to change between adding it to and subtracting
it from children_weight, and triggers WARN_ON_ONCE() in
cfq_group_service_tree_del(), or even causes oops by divide error during
vfr calculation in cfq_group_service_tree_add().

The detailed scenario is as follows:
1. Create blkio cgroups X and Y as a child of X.
   Set X's weight to 500 and perform some I/O to apply new_weight.
   This X's I/O completes before starting Y's I/O.
2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
3. cfq_group_service_tree_add() walks up the tree during children_weight
   calculation and adds parent X's weight (500) to children_weight of root.
   children_weight becomes 500.
4. Set X's weight to 1000.
5. X starts I/O and cfq_group_service_tree_add() is called with X.
6. cfq_group_service_tree_add() applies its new_weight (1000).
7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
8. I/O of X completes and cfq_group_service_tree_del() is called with X.
9. cfq_group_service_tree_del() subtracts X's weight (1000) from
   children_weight of root. children_weight becomes -500.
   This triggers WARN_ON_ONCE().
10. Set X's weight to 500.
11. X starts I/O and cfq_group_service_tree_add() is called with X.
12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
to children_weight of root. children_weight becomes 0. Calcularion of
vfr triggers oops by divide error.

weight should be updated right before adding it to children_weight.

Reported-by: Ruki Sekiya 
Signed-off-by: Toshiaki Makita 
Acked-by: Tejun Heo 
Cc: sta...@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Dmitry Monakhov 
---
 block/cfq-iosched.c |   11 ---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cadc378..d749463 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1275,12 +1275,16 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, 
struct cfq_group *cfqg)
 static void
 cfq_update_group_weight(struct cfq_group *cfqg)
 {
-   BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
-
if (cfqg->new_weight) {
cfqg->weight = cfqg->new_weight;
cfqg->new_weight = 0;
}
+}
+
+static void
+cfq_update_group_leaf_weight(struct cfq_group *cfqg)
+{
+   BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
 
if (cfqg->new_leaf_weight) {
cfqg->leaf_weight = cfqg->new_leaf_weight;
@@ -1299,7 +1303,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct 
cfq_group *cfqg)
/* add to the service tree */
BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
 
-   cfq_update_group_weight(cfqg);
+   cfq_update_group_leaf_weight(cfqg);
__cfq_group_service_tree_add(st, cfqg);
 
/*
@@ -1323,6 +1327,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct 
cfq_group *cfqg)
 */
while ((parent = cfqg_parent(pos))) {
if (propagate) {
+   cfq_update_group_weight(pos);
propagate = !parent->nr_active++;
parent->children_weight += pos->weight;
}
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH] ext4: move ext4_truncate_data_csum out of transaction

2015-07-20 Thread Dmitry Monakhov

ext4_truncate_data_csum implicitly require journal transatcion so it can not be
called inside opened transaction.

BAD_CHAIN-#1:
 ->generic_file_buffered_write_iter
   ->ext4_da_write_begin
 ->ext4_journal_start( ,,1) : reserve 1 journal block
   ->ext4_write_end
 ->ext4_update_data_csum
   ->ext4_truncate_data_csum
 ->ext4_xattr_set
   ->ext4_journal_start(,,20): require 20 blocks,
   but since journal already started
   it use existing handle
->jbd2_journal_dirty_metadata
   J_ASSERT_JH(jh, handle->h_buffer_credits > 0) -> ASSERT
BAD_CHAIN-#2
ext4_evict_inode
 ->ext4_journal_start_sb
 ->ext4_truncate
   ->ext4_truncate_data_csum
 ->ext4_close_pfcache
   ->close_mapping_peer
 ->touch_atime
   ->update_time
 ->ext4_dirty_inode
   ->ext4_journal_start_sb -> start journal on another FS ->BUGON

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/inode.c   |2 ++
 fs/ext4/pfcache.c |3 +++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2b05910..30ae6b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -237,6 +237,8 @@ void ext4_evict_inode(struct inode *inode)
 * protection against it
 */
sb_start_intwrite(inode->i_sb);
+   if (inode->i_blocks && ext4_test_inode_state(inode, 
EXT4_STATE_PFCACHE_CSUM))
+   ext4_truncate_data_csum(inode, inode->i_size);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
ext4_blocks_for_truncate(inode)+3);
if (IS_ERR(handle)) {
diff --git a/fs/ext4/pfcache.c b/fs/ext4/pfcache.c
index bf45504..902bc0d 100644
--- a/fs/ext4/pfcache.c
+++ b/fs/ext4/pfcache.c
@@ -446,6 +446,8 @@ static int ext4_save_data_csum(struct inode *inode, u8 
*csum)
 {
int ret;
 
+   WARN_ON(journal_current_handle());
+
if (ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM) &&
EXT4_I(inode)->i_data_csum_end < 0 &&
memcmp(EXT4_I(inode)->i_data_csum, csum, EXT4_DATA_CSUM_SIZE))
@@ -501,6 +503,7 @@ int ext4_truncate_data_csum(struct inode *inode, loff_t pos)
return 0;
 
if (EXT4_I(inode)->i_data_csum_end < 0) {
+   WARN_ON(journal_current_handle());
ext4_xattr_set(inode, EXT4_XATTR_INDEX_TRUSTED,
EXT4_DATA_CSUM_NAME, NULL, 0, 0);
ext4_close_pfcache(inode);
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [RH7 PATCH 1/3] compile fix for ext4-add-mfsync-support

2015-07-21 Thread Dmitry Monakhov

ext4_flush_unwritten_io was removed in rh7-3.10.0-229.7.2

https://jira.sw.ru/browse/PSBM-34909

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/fsync.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 99582b8..8235438 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -209,9 +209,6 @@ int ext4_sync_files(struct file **files, unsigned int 
*flags, unsigned int nr_fi
}
 
mutex_lock(&inode->i_mutex);
-   err2 = ext4_flush_unwritten_io(inode);
-   if (!err || err2 == -EIO)
-   err = err2;
force_commit  |= ext4_should_journal_data(inode);
datawriteback |= ext4_should_writeback_data(inode);
tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

1 2 >

1 - 100 of 185 matches

Mail list logo