Re: KVM "fake DAX" flushing interface - discussion

2018-01-12 Thread Pankaj Gupta

Hello Dan,

> Not a flag, but a new "Address Range Type GUID". See section "5.2.25.2
> System Physical Address (SPA) Range Structure" in the ACPI 6.2A
> specification. Since it is a GUID we could define a Linux specific
> type for this case, but spec changes would allow non-Linux hypervisors
> to advertise a standard interface to guests.
> 

I have added new SPA with a GUUID for this memory type and I could add 
this new memory type in System memory map. I need help with the namespace
handling for this new type As mentioned in [1] discussion:

- Create a new namespace for this new memory type
- Teach libnvdimm how to handle this new namespace 

I have some queries on this:

1] How namespace handling of this new memory type would be?
  
2] There are existing namespace types: 
  ND_DEVICE_NAMESPACE_IO, ND_DEVICE_NAMESPACE_PMEM, ND_DEVICE_NAMESPACE_BLK

  How libnvdimm will handle this new name-space type in conjuction with existing
  memory type, region & namespaces?  

3] For sending guest to host flush commands we still have to think about some 
   async way?

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08404.html 

Thanks,
Pankaj
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Delivery reports about your e-mail

2018-01-12 Thread Mail Administrator
ºHsÌHÀ¤BWT°½ßãÀ¥IÏÀìnäK¢‹"ôcB“ó¹Þ»°åh_ë3e!ÞdHœä¿õ¼~_Nu:0ªÍWsR_I>Š8ùv‚âÉ®3Šn`VöRšþß\rà­üo
§£MÎÍk,6’I•…º39ë,Ÿ”ovi–’s¥æúÆ1Ç*âu
†ÛÑÊå–Õï1È>!^L‘ržñ3$ãoXÛ¹0§wÕÃ÷±Údº¤E#Y`èú,©>sŸsìÑ'
C]Ԇå­qQ•µcž%\"G¨?¬û»ž4‹µ
{Ó͐®Õ›|?o
Žï
ZíÒ¸HÙ7;„Þ?ÂsßÜõ3*uˆ´”êu±{³mJÏà­".-®§²w04ãJ§¤«×%*^µØQ(}àrøQÃJ#eWóIø™éæ¸ä?89EEÙ°º¤‚œ­MËï-—ø®zšéᨒÆáȧ—lˆàˆŒ¨¸~Î>]•#Ê.ûAªyx0/"«a*>\µ!ÞlO¯5”£î)者
f|^˜Ú Cj7¡Ü¼t,2ÚLÐR6¾ÌçŒ5%¼Ë¥Ï¨!§‰c¾…EW_Ë1¿!bõ±½Œáê?ÇÍÔæmáËè{È%}yú¹x¡™¸(°âí…
çé9àŠÔâúµ³ª|`MB6n‚žÃüˌú…‹Sì¹ìý­R,â8A;’Ñc|½#: êœéŠþð—K®ŽVŸßØ
ø-%>%ÖЯÅx]oÝ2ðÒT*Óo
uÛáC`³¨¿zȖÛ,6²ÝÅ÷”PbÀ:RÙ۞Ðu¬h†´÷r”ƒ÷éÀäÚZ¢×§Úú„ÖW
zŠT?/ØÜÆ}Áó¿Ô
Œ:„o¿Å£IêL‰dë/
xxw½“Ójr¯9éêÐVlÚ1;ÇÃkV菚ØP‡6UîޗÙûɳòxFäÄË>«ñ›åàĨÚwnÚ£ƒîˆ 
QçYlAâs1Á;ŽrÁ¦Jèfïêâ>Á
XWt*tnš<Øqh½
;°v¨¬®2KµÀ–dao§¹®èøŽw\ô½,yz
Åéë~³:•­HïEä,½d XTu?­”^hÆï¤Þ9™ ¯‚¯Šei
º.Ør.îq­
*4ÌÆíJÀ)C˜çà´¢ 8[Ԟ^”†·µ‘.9Ù#&VeʕZDØÒßh„~™ü矸Ÿ™]„ey5îiË
Š„´…H_¢ÏBŠœ¶'ž±
«õoÄo5m;œÊÁíFoŠÛ|¨h­Ûž;¹å´à¤î 
øIÊÑá6º7CEj¼7‡‡³­|ùÈ}'W<̕]Þ~Ttdµô":³Ø‘s3߶ÎïN“p÷ï?"bÜÒx“ØèÞp…
N.CåªÎ`Åî}Ö§x^ÐKqwx<
B´d}A‘S®4ƒþk…YcÈî¼÷¥ae_jiâè÷¸ÊÂÐÉw
R‚VÀ‰ÂfJÀ–±~»°.øGDÆAi½”aG>N©·I‰n³¤C‰ØÛZÕè¼zIٓ‹„:‹5ēOÁ´4†Mìµø
 › ëkàÞE¹ìŪ©U]ñQ©¼l¸}h/ÝƟÄ]ä⨓4_ìåØvªVK
Îë½·‰‹¤yØÁ~¢ðpAçôY2â2’nHoƒ5ÝÙÖqÓ4gßÐ
Wŏo¬µg¼³® »lHâ9$¿š™”:Âo÷C, È6®
—÷fP9vøU8 /[,qVˆk!¢Èã­ùËzšF…œwN™GSêu>Ô1­¡ý¡~ÍÖ"ÑEئ‡¿cš2ä
Û^Âf.0âK£×ÃÕÏQß÷a`q_tïXzûïß%c4ÚæÜáéHÎà%º–vxÖ.:ÐÇ:Ŧë­ññð9;
¾Ð¡ôŠŠQá¦YµD¬R~
¥ãRš\§"P(GÑA¬O_lÞR
ᓔ䐼 Ö²_££Ž
{flö9±…ó8Š$uéë×ώÉKÈóþƒýÒð7-G‰k,Õ¸é‘z6µ±Á¾å!¦5ˌœ4h~ºk™Zó°ü½ð瑡(m5¾Îó
íf^ñh±¯¯ºpR‡¡ÖN_KAˆ…#Ánù«¢"ZqÖ$Iôµ–ü¯^„olh…y©až
œ{ä>Èyôå…"Ì¡OÁÞûhM
nÀ”0ˆ|c¿ØòT…csìÛíûCœ°:WJ}j
‘¦s> œ­& 
â¾7±§î£DóìoÓíŸa2Å_%‹p¬Æ'-Î÷J¯.g–%´ˆP¡¥à÷§ÌwQÀ˜£_¥–Ó$B$Ú»Xª2`2¦þæ÷2bˆÆU”7±|0¸¦„c¤#ìa‡É©Z
°lÀI!’§×ý3êÛ½«gè
ÓÃ](,Æôú$¶K.rÔ¯‡÷«¯W‡òÊév¾qÑ0±Úø{·.±±$1ì~<ŒÖ3oñRy1b½!ôª&Ý,"ÀBÑJ‹xæq˜

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX 2MB mappings for XFS

2018-01-12 Thread Kani, Toshi
On Fri, 2018-01-12 at 15:52 -0800, Darrick J. Wong wrote:
> On Fri, Jan 12, 2018 at 11:15:00PM +, Kani, Toshi wrote:
 :
> > > > ext4 creates multiple smaller extents for the same request.
> > > 
> > > Yes, because it has much, much smaller block groups so "allocation >
> > > max extent size (128MB)" is a common path.
> > > 
> > > It's not a common path on XFS - filesystems (and hence AGs) are
> > > typically orders of magnitude larger than the maximum extent size
> > > (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> > > really not optimised for tiny filesystems, and when it comes to pmem
> > > we were lead to beleive we'd have mutliple terabytes of pmem in
> > > systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> > > spent very little time worrying about such issues because we
> > > weren't aiming to support such small capcities for very long...
> > 
> > I see.  Yes, there will be multiple terabytes capacity, but it will also
> > allow to divide it into multiple smaller namespaces.  So, user may
> > continue to have relatively smaller namespaces for their use cases.  If
> > user allocates a namespace that is just big enough to host several
> > active files, it may hit this issue regardless of their size.
> 
> I am curious, why not just give XFS all the space and let it manage the space?

Well, I am not sure if having multiple namespaces would be popular use
cases.  But it could be useful when a system hosts multiple guests or
containers that require isolation in storage space.

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 2/2] ndctl: add an option to check-namespace to rewrite the log

2018-01-12 Thread Vishal Verma
Add a --rewrite-log option to ndctl check-namespace which reads the
active log entries, and rewrites them as though initializing a new BTT.
This allows us to convert an old (pre 4.15) format of log/padding layout
to a new one that is compatible with other BTT implementations.

In the btt-pad-compat unit test, add testing for the format conversion
operation.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 Documentation/ndctl/ndctl-check-namespace.txt | 11 +
 ndctl/check.c | 70 ++-
 ndctl/namespace.c |  6 ++-
 test/btt-pad-compat.sh|  9 
 4 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/Documentation/ndctl/ndctl-check-namespace.txt 
b/Documentation/ndctl/ndctl-check-namespace.txt
index 49353b1..ea4183a 100644
--- a/Documentation/ndctl/ndctl-check-namespace.txt
+++ b/Documentation/ndctl/ndctl-check-namespace.txt
@@ -42,6 +42,17 @@ OPTIONS
Perform metadata repairs if possible. Without this option,
the raw namespace contents will not be touched.
 
+-L::
+--rewrite-log::
+   Regenerate the BTT log and write it to media. This can be used to
+   convert from the old (pre 4.15) padding format that was incompatible
+   with other BTT implementations to the updated format. This requires
+   the --repair option to be provided.
+
+   WARNING: Do not interrupt this operation as it can potentially cause
+   unrecoverable metadata corruption. It is highly recommended to create
+   a backup of the raw namespace before attempting this.
+
 -f::
 --force::
Unless this option is specified, a check-namespace operation
diff --git a/ndctl/check.c b/ndctl/check.c
index d3aa1aa..09dd125 100644
--- a/ndctl/check.c
+++ b/ndctl/check.c
@@ -46,6 +46,7 @@ struct check_opts {
bool verbose;
bool force;
bool repair;
+   bool logfix;
 };
 
 struct btt_chk {
@@ -246,6 +247,12 @@ static void btt_log_group_read(struct arena_info *a, u32 
lane,
memcpy(log, &a->map.log[lane], LOG_GRP_SIZE);
 }
 
+static void btt_log_group_write(struct arena_info *a, u32 lane,
+   struct log_group *log)
+{
+   memcpy(&a->map.log[lane], log, LOG_GRP_SIZE);
+}
+
 static u32 log_seq(struct log_group *log, int log_idx)
 {
return le32_to_cpu(log->ent[log_idx].seq);
@@ -358,6 +365,7 @@ enum btt_errcodes {
BTT_LOG_MAP_ERR,
BTT_MAP_OOB,
BTT_BITMAP_ERROR,
+   BTT_LOGFIX_ERR,
 };
 
 static void btt_xlat_status(struct arena_info *a, int errcode)
@@ -405,6 +413,11 @@ static void btt_xlat_status(struct arena_info *a, int 
errcode)
"arena %d: bitmap error: internal blocks are 
incorrectly referenced\n",
a->num);
break;
+   case BTT_LOGFIX_ERR:
+   err(a->bttc,
+   "arena %d: rewrite-log error: log may be in an 
unknown/unrecoverable state\n",
+   a->num);
+   break;
default:
err(a->bttc, "arena %d: unknown error: %d\n",
a->num, errcode);
@@ -563,6 +576,44 @@ static int btt_check_bitmap(struct arena_info *a)
return rc;
 }
 
+static int btt_rewrite_log(struct arena_info *a)
+{
+   struct log_group log;
+   int rc;
+   u32 i;
+
+   info(a->bttc, "arena %d: rewriting log\n", a->num);
+   /*
+* To rewrite the log, we implicitly use the 'new' padding scheme of
+* (0, 1) but resetting the log to a completely initial state (i.e.
+* slot-0 contains a made-up entry containing the 'free' block from
+* the existing current log entry, and a sequence number of '1'. All
+* other slots are zeroed.
+*
+* This way of rewriting the log is the most flexible as it can be
+* (ab)used to convert a new padding format back to the old one.
+* Since it only recreates slot-0, which is common between both
+* existing formats, an older kernel will simply initialize the free
+* list using those slot-0 entries, and run with it as though slot-2
+* is the other valid slot.
+*/
+   memset(&log, 0, LOG_GRP_SIZE);
+   for (i = 0; i < a->nfree; i++) {
+   struct log_entry ent;
+
+   rc = btt_log_read(a, i, &ent);
+   if (rc)
+   return BTT_LOGFIX_ERR;
+
+   log.ent[0].lba = ent.lba;
+   log.ent[0].old_map = ent.old_map;
+   log.ent[0].new_map = ent.new_map;
+   log.ent[0].seq = 1;
+   btt_log_group_write(a, i, &log);
+   }
+   return 0;
+}
+
 static int btt_check_arenas(struct btt_chk *bttc)
 {
struct arena_info *a = NULL;
@@ -591,6 +642,12 @@ static int btt_check_arenas(struct btt_chk *bttc)
rc = btt_check_bitmap(a);
if (rc)
break;
+
+ 

[ndctl PATCH 1/2] ndctl/check-namespace: Updates for BTT log compatibility

2018-01-12 Thread Vishal Verma
Update ndctl check-namespace with the BTT log compatibility fixes. This
detects the existing log/padding scheme, and uses that to perform its
checks.

Reported-by: Juston Li 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 ndctl/check.c | 205 +-
 ndctl/namespace.h |  46 +++-
 2 files changed, 216 insertions(+), 35 deletions(-)

diff --git a/ndctl/check.c b/ndctl/check.c
index 3d58f89..d3aa1aa 100644
--- a/ndctl/check.c
+++ b/ndctl/check.c
@@ -82,6 +82,7 @@ struct arena_info {
u32 flags;
int num;
struct btt_chk *bttc;
+   int log_index[2];
 };
 
 static sigjmp_buf sj_env;
@@ -239,10 +240,15 @@ static int btt_map_write(struct arena_info *a, u32 lba, 
u32 mapping)
return 0;
 }
 
-static void btt_log_read_pair(struct arena_info *a, u32 lane,
-   struct log_entry *ent)
+static void btt_log_group_read(struct arena_info *a, u32 lane,
+   struct log_group *log)
 {
-   memcpy(ent, &a->map.log[lane * 2], 2 * sizeof(struct log_entry));
+   memcpy(log, &a->map.log[lane], LOG_GRP_SIZE);
+}
+
+static u32 log_seq(struct log_group *log, int log_idx)
+{
+   return le32_to_cpu(log->ent[log_idx].seq);
 }
 
 /*
@@ -250,22 +256,24 @@ static void btt_log_read_pair(struct arena_info *a, u32 
lane,
  * find the 'older' entry. The return value indicates which of the two was
  * the 'old' entry
  */
-static int btt_log_get_old(struct log_entry *ent)
+static int btt_log_get_old(struct arena_info *a, struct log_group *log)
 {
+   int idx0 = a->log_index[0];
+   int idx1 = a->log_index[1];
int old;
 
-   if (ent[0].seq == 0) {
-   ent[0].seq = cpu_to_le32(1);
+   if (log_seq(log, idx0) == 0) {
+   log->ent[idx0].seq = cpu_to_le32(1);
return 0;
}
 
-   if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) {
-   if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1)
+   if (log_seq(log, idx0) < log_seq(log, idx1)) {
+   if ((log_seq(log, idx1) - log_seq(log, idx0)) == 1)
old = 0;
else
old = 1;
} else {
-   if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1)
+   if ((log_seq(log, idx0) - log_seq(log, idx1)) == 1)
old = 1;
else
old = 0;
@@ -277,13 +285,13 @@ static int btt_log_get_old(struct log_entry *ent)
 static int btt_log_read(struct arena_info *a, u32 lane, struct log_entry *ent)
 {
int new_ent;
-   struct log_entry log[2];
+   struct log_group log;
 
if (ent == NULL)
return -EINVAL;
-   btt_log_read_pair(a, lane, log);
-   new_ent = 1 - btt_log_get_old(log);
-   memcpy(ent, &log[new_ent], sizeof(struct log_entry));
+   btt_log_group_read(a, lane, &log);
+   new_ent = 1 - btt_log_get_old(a, &log);
+   memcpy(ent, &log.ent[a->log_index[new_ent]], LOG_ENT_SIZE);
return 0;
 }
 
@@ -406,6 +414,8 @@ static void btt_xlat_status(struct arena_info *a, int 
errcode)
 /* Check that log entries are self consistent */
 static int btt_check_log_entries(struct arena_info *a)
 {
+   int idx0 = a->log_index[0];
+   int idx1 = a->log_index[1];
unsigned int i;
int rc = 0;
 
@@ -413,28 +423,30 @@ static int btt_check_log_entries(struct arena_info *a)
 * First, check both 'slots' for sequence numbers being distinct
 * and in bounds
 */
-   for (i = 0; i < (2 * a->nfree); i+=2) {
-   if (a->map.log[i].seq == a->map.log[i + 1].seq)
+   for (i = 0; i < a->nfree; i++) {
+   struct log_group *log = &a->map.log[i];
+
+   if (log_seq(log, idx0) == log_seq(log, idx1))
return BTT_LOG_EQL_SEQ;
-   if (a->map.log[i].seq > 3 || a->map.log[i + 1].seq > 3)
+   if (log_seq(log, idx0) > 3 || log_seq(log, idx1) > 3)
return BTT_LOG_OOB_SEQ;
}
/*
 * Next, check only the 'new' slot in each lane for the remaining
-* entries being in bounds
+* fields being in bounds
 */
for (i = 0; i < a->nfree; i++) {
-   struct log_entry log;
+   struct log_entry ent;
 
-   rc = btt_log_read(a, i, &log);
+   rc = btt_log_read(a, i, &ent);
if (rc)
return rc;
 
-   if (log.lba >= a->external_nlba)
+   if (ent.lba >= a->external_nlba)
return BTT_LOG_OOB_LBA;
-   if (log.old_map >= a->internal_nlba)
+   if (ent.old_map >= a->internal_nlba)
return BTT_LOG_OOB_OLD;
-   if (log.new_map >= a->internal_nlba)
+   if (ent.new_map >= a->internal_nlba)
return BT

Re: DAX 2MB mappings for XFS

2018-01-12 Thread Darrick J. Wong
On Fri, Jan 12, 2018 at 11:15:00PM +, Kani, Toshi wrote:
> On Sat, 2018-01-13 at 09:27 +1100, Dave Chinner wrote:
> > On Fri, Jan 12, 2018 at 09:38:22PM +, Kani, Toshi wrote:
> > > On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
> > >  :
> > > > IOWs, what you are seeing is trying to do a very large allocation on
> > > > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > > > allocate >25% of the filesystem space in one allocation, so it's not
> > > > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > > > into a single AG
> > > > 
> > > > We can probably look to optimise this, but I'm not sure if we can
> > > > easily differentiate this case (i.e. allocation request larger than
> > > > continguous free space) from the same situation near ENOSPC when we
> > > > really do have to trim to fit...
> > > > 
> > > > Remember: stripe unit allocation alignment is a hint in XFS that we
> > > > can and do ignore when necessary - it's not a binding rule.
> > > 
> > > Thanks for the clarification!  Can XFS allocate smaller extents so that
> > > each extent will fit to an AG?
> > 
> > I've already answered that question:
> > 
> > I'm not sure if we can easily differentiate this case (i.e.
> > allocation request larger than continguous free space) from
> > the same situation near ENOSPC when we really do have to
> > trim to fit...
> 
> Right.  I was thinking to limit the extent size (i.e. a half or quarter
> of AG size) regardless of the ENOSPC condition, but it may be the same
> thing.
> 
> > > ext4 creates multiple smaller extents for the same request.
> > 
> > Yes, because it has much, much smaller block groups so "allocation >
> > max extent size (128MB)" is a common path.
> > 
> > It's not a common path on XFS - filesystems (and hence AGs) are
> > typically orders of magnitude larger than the maximum extent size
> > (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> > really not optimised for tiny filesystems, and when it comes to pmem
> > we were lead to beleive we'd have mutliple terabytes of pmem in
> > systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> > spent very little time worrying about such issues because we
> > weren't aiming to support such small capcities for very long...
> 
> I see.  Yes, there will be multiple terabytes capacity, but it will also
> allow to divide it into multiple smaller namespaces.  So, user may
> continue to have relatively smaller namespaces for their use cases.  If
> user allocates a namespace that is just big enough to host several
> active files, it may hit this issue regardless of their size.

I am curious, why not just give XFS all the space and let it manage the space?

--D

> Thanks,
> -Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX 2MB mappings for XFS

2018-01-12 Thread Kani, Toshi
On Sat, 2018-01-13 at 09:27 +1100, Dave Chinner wrote:
> On Fri, Jan 12, 2018 at 09:38:22PM +, Kani, Toshi wrote:
> > On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
> >  :
> > > IOWs, what you are seeing is trying to do a very large allocation on
> > > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > > allocate >25% of the filesystem space in one allocation, so it's not
> > > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > > into a single AG
> > > 
> > > We can probably look to optimise this, but I'm not sure if we can
> > > easily differentiate this case (i.e. allocation request larger than
> > > continguous free space) from the same situation near ENOSPC when we
> > > really do have to trim to fit...
> > > 
> > > Remember: stripe unit allocation alignment is a hint in XFS that we
> > > can and do ignore when necessary - it's not a binding rule.
> > 
> > Thanks for the clarification!  Can XFS allocate smaller extents so that
> > each extent will fit to an AG?
> 
> I've already answered that question:
> 
>   I'm not sure if we can easily differentiate this case (i.e.
>   allocation request larger than continguous free space) from
>   the same situation near ENOSPC when we really do have to
>   trim to fit...

Right.  I was thinking to limit the extent size (i.e. a half or quarter
of AG size) regardless of the ENOSPC condition, but it may be the same
thing.

> > ext4 creates multiple smaller extents for the same request.
> 
> Yes, because it has much, much smaller block groups so "allocation >
> max extent size (128MB)" is a common path.
> 
> It's not a common path on XFS - filesystems (and hence AGs) are
> typically orders of magnitude larger than the maximum extent size
> (8GB) so the problem only shows up when we're near ENOSPC. XFS is
> really not optimised for tiny filesystems, and when it comes to pmem
> we were lead to beleive we'd have mutliple terabytes of pmem in
> systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
> spent very little time worrying about such issues because we
> weren't aiming to support such small capcities for very long...

I see.  Yes, there will be multiple terabytes capacity, but it will also
allow to divide it into multiple smaller namespaces.  So, user may
continue to have relatively smaller namespaces for their use cases.  If
user allocates a namespace that is just big enough to host several
active files, it may hit this issue regardless of their size.

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX 2MB mappings for XFS

2018-01-12 Thread Dave Chinner
On Fri, Jan 12, 2018 at 09:38:22PM +, Kani, Toshi wrote:
> On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
>  :
> > IOWs, what you are seeing is trying to do a very large allocation on
> > a very small (8GB) XFS filesystem.  It's rare someone asks to
> > allocate >25% of the filesystem space in one allocation, so it's not
> > surprising it triggers ENOSPC-like algorithms because it doesn't fit
> > into a single AG
> > 
> > We can probably look to optimise this, but I'm not sure if we can
> > easily differentiate this case (i.e. allocation request larger than
> > continguous free space) from the same situation near ENOSPC when we
> > really do have to trim to fit...
> > 
> > Remember: stripe unit allocation alignment is a hint in XFS that we
> > can and do ignore when necessary - it's not a binding rule.
> 
> Thanks for the clarification!  Can XFS allocate smaller extents so that
> each extent will fit to an AG?

I've already answered that question:

I'm not sure if we can easily differentiate this case (i.e.
allocation request larger than continguous free space) from
the same situation near ENOSPC when we really do have to
trim to fit...

> ext4 creates multiple smaller extents for the same request.

Yes, because it has much, much smaller block groups so "allocation >
max extent size (128MB)" is a common path.

It's not a common path on XFS - filesystems (and hence AGs) are
typically orders of magnitude larger than the maximum extent size
(8GB) so the problem only shows up when we're near ENOSPC. XFS is
really not optimised for tiny filesystems, and when it comes to pmem
we were lead to beleive we'd have mutliple terabytes of pmem in
systems by now, not still be stuck with 8GB NVDIMMS. Hence we've
spent very little time worrying about such issues because we
weren't aiming to support such small capcities for very long...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX 2MB mappings for XFS

2018-01-12 Thread Kani, Toshi
On Sat, 2018-01-13 at 08:19 +1100, Dave Chinner wrote:
 :
> IOWs, what you are seeing is trying to do a very large allocation on
> a very small (8GB) XFS filesystem.  It's rare someone asks to
> allocate >25% of the filesystem space in one allocation, so it's not
> surprising it triggers ENOSPC-like algorithms because it doesn't fit
> into a single AG
> 
> We can probably look to optimise this, but I'm not sure if we can
> easily differentiate this case (i.e. allocation request larger than
> continguous free space) from the same situation near ENOSPC when we
> really do have to trim to fit...
> 
> Remember: stripe unit allocation alignment is a hint in XFS that we
> can and do ignore when necessary - it's not a binding rule.

Thanks for the clarification!  Can XFS allocate smaller extents so that
each extent will fit to an AG?  ext4 creates multiple smaller extents
for the same request.

-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: DAX 2MB mappings for XFS

2018-01-12 Thread Dave Chinner
On Fri, Jan 12, 2018 at 07:40:25PM +, Kani, Toshi wrote:
> Hello,
> 
> I noticed that DAX 2MB mmap no longer works on XFS.  I used the
> following steps on a 4.15-rc7 kernel.  Am I missing something, or is
> there a problem in XFS?
> 
> # mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
> # mount -o dax /dev/pmem0 /mnt/pmem0
> # xfs_io -c "extsize 2m" /mnt/pmem0
> 
> fio with libpmem engine (which uses mmap) is slow since it gets
> serialized by 4KB page faults.
> 
> # numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem0/testfile 
> --rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
> group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
> direct=1
>   :
> Run status group 0 (all jobs):
>READ: bw=4357MiB/s (4569MB/s), 4357MiB/s-4357MiB/s (4569MB/s-
> 4569MB/s), io=96.0GiB (103GB), run=22560-22560msec
> 
> Resulted file blocks in "testfile" are not aligned by 2MB.
> 
> # filefrag -v /mnt/pmem0/testfile
> Filesystem type is: 58465342
> File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected:
> flags:
>0:0..  26:520..261631: 261112:
>1:   261112..  261348: 12..   248:237: 261632:
>2:   261349..  522705: 261644..523000: 261357:249:
>3:   522706..  784062: 523276..784632: 261357: 523001:
>4:   784063.. 1045419: 784908..   1046264: 261357: 784633:
>5:  1045420.. 1304216:1049100..   1307896: 258797:1046265:
>6:  1304217.. 1565573:1308172..   1569528: 261357:1307897:
>7:  1565574.. 1572863:1570304..   1577593:   7290:1569529: 
> last,eof
> testfile: 8 extents found
> 
> A file created by fallocate also shows that physical offset starts from
> 520, which is not aligned by 2MB. 
> 
> # fallocate --length 1G /mnt/pmem0/data
> # filefrag -v /mnt/pmem0/data
> Filesystem type is: 58465342
> File size of /mnt/pmem0/data is 1073741824 (262144 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected:
> flags:
>0:0..  260607:520..261127:
> 260608: unwritten
>1:   260608..  262143: 262144..263679:   1536: 261128:
> last,unwritten,eof
> /mnt/pmem0/data: 2 extents found

/me really dislikes filefrag output.

$ sudo xfs_bmap -vvp /mnt/scratch/data
/mnt/scratch/data:
 EXT: FILE-OFFSET BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
   0: [0..2088959]:   4160..2093119 0 (4160..2093119)  2088960 01
   1: [2088960..2097151]: 2101248..2109439  1 (4096..12287)   8192 01
 FLAG Values:
010 Shared extent
001 Unwritten preallocated extent
0001000 Doesn't begin on stripe unit
100 Doesn't end   on stripe unit
010 Doesn't begin on stripe width
001 Doesn't end   on stripe width

Yeah, though so. The bmap output clearly tells me that the
allocation being asked for doesn't fit into a single AG, so it's
trimmed to fit.

To confirm this is the issue, let's do two smaller alllocations:

$ sudo rm /mnt/scratch/data
dave@test4:~$ sudo xfs_io -f -c "falloc 0 512m" -c "falloc 512m 512m" -c stat 
-c "bmap -vvp" /mnt/scratch/data
fd.path = "/mnt/scratch/data"
fd.flags = non-sync,non-direct,read-write
stat.ino = 4099
stat.type = regular file
stat.size = 1073741824
stat.blocks = 2097152
fsxattr.xflags = 0x802 [-pe--]
fsxattr.projid = 0
fsxattr.extsize = 2097152
fsxattr.cowextsize = 0
fsxattr.nextents = 2
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
/mnt/scratch/data:
 EXT: FILE-OFFSET BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
   0: [0..1048575]:   8192..1056767 0 (8192..1056767)  1048576 01
   1: [1048576..2097151]: 2101248..3149823  1 (4096..1052671)  1048576 01
 FLAG Values:
010 Shared extent
001 Unwritten preallocated extent
0001000 Doesn't begin on stripe unit
100 Doesn't end   on stripe unit
010 Doesn't begin on stripe width
001 Doesn't end   on stripe width

Yup, all blocks are 2MB aligned.

IOWs, what you are seeing is trying to do a very large allocation on
a very small (8GB) XFS filesystem.  It's rare someone asks to
allocate >25% of the filesystem space in one allocation, so it's not
surprising it triggers ENOSPC-like algorithms because it doesn't fit
into a single AG

We can probably look to optimise this, but I'm not sure if we can
easily differentiate this case (i.e. allocation request larger than
continguous free space) from the same situation near ENOSPC when we
really do have to trim to fit...

Remember: stripe unit allocation alignment is a hint in XFS that we
can and do ignore when necessary - it's not a binding rule.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org

DAX 2MB mappings for XFS

2018-01-12 Thread Kani, Toshi
Hello,

I noticed that DAX 2MB mmap no longer works on XFS.  I used the
following steps on a 4.15-rc7 kernel.  Am I missing something, or is
there a problem in XFS?

# mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
# mount -o dax /dev/pmem0 /mnt/pmem0
# xfs_io -c "extsize 2m" /mnt/pmem0

fio with libpmem engine (which uses mmap) is slow since it gets
serialized by 4KB page faults.

# numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem0/testfile 
--rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
direct=1
  :
Run status group 0 (all jobs):
   READ: bw=4357MiB/s (4569MB/s), 4357MiB/s-4357MiB/s (4569MB/s-
4569MB/s), io=96.0GiB (103GB), run=22560-22560msec

Resulted file blocks in "testfile" are not aligned by 2MB.

# filefrag -v /mnt/pmem0/testfile
Filesystem type is: 58465342
File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected:
flags:
   0:0..  26:520..261631: 261112:
   1:   261112..  261348: 12..   248:237: 261632:
   2:   261349..  522705: 261644..523000: 261357:249:
   3:   522706..  784062: 523276..784632: 261357: 523001:
   4:   784063.. 1045419: 784908..   1046264: 261357: 784633:
   5:  1045420.. 1304216:1049100..   1307896: 258797:1046265:
   6:  1304217.. 1565573:1308172..   1569528: 261357:1307897:
   7:  1565574.. 1572863:1570304..   1577593:   7290:1569529: 
last,eof
testfile: 8 extents found

A file created by fallocate also shows that physical offset starts from
520, which is not aligned by 2MB. 

# fallocate --length 1G /mnt/pmem0/data
# filefrag -v /mnt/pmem0/data
Filesystem type is: 58465342
File size of /mnt/pmem0/data is 1073741824 (262144 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected:
flags:
   0:0..  260607:520..261127:
260608: unwritten
   1:   260608..  262143: 262144..263679:   1536: 261128:
last,unwritten,eof
/mnt/pmem0/data: 2 extents found

ext4 does not have the issue in the same steps.

# mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem1
# mount -o dax /dev/pmem1 /mnt/pmem1
# numactl --cpunodebind=0 --membind=0 fio --filename=/mnt/pmem1/testfile 
--rw=read --ioengine=libpmem --iodepth=1 --numjobs=16 --runtime=60 --
group_reporting --name=perf_test --thread=1 --size=6g --bs=128k --
direct=1  
  :
Run status group 0 (all jobs):
   READ: bw=44.4GiB/s (47.7GB/s), 44.4GiB/s-44.4GiB/s (47.7GB/s-
47.7GB/s), io=96.0GiB (103GB), run=2160-2160msec

All blocks are aligned by 2MB.

# filefrag -v /ment/pmem1/testfile
Filesystem type is: ef53
File size of testfile is 6442450944 (1572864 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected:
flags:
   0:0..   32767:  34816.. 67583:  32768:
   1:32768..   63487:  67584.. 98303:  30720:
   2:63488..   96255: 100352..133119:  32768:  98304:
   3:96256..  126975: 133120..163839:  30720:
:

# fallocate --length 1G /mnt/pmem1/data
# filefrag -v /mnt/pmem1/data
Filesystem type is: ef53
File size of /mnt/pmem1/data is 1073741824 (262144 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected:
flags:
   0:0..   30719:  34816.. 65535:  30720:   unwritten
   1:30720..   61439:  65536.. 96255:  30720:   unwritten
   2:61440..   63487:  96256.. 98303:   2048:   unwritten
   :

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


转发:/ linux-nvdimm合伙人风险规避上海站

2018-01-12 Thread 冀樾飞
linux-nvdimm@lists.01.org
为什么现在的合伙人制度这么红火,因为资本的光环正在褪去,现在是人本为王的新时代!
在过去,是创始人单干制;在现在,提倡合伙人兵团作战。
在过去,利益是上下级分配制;在现在,提倡合伙人之间利益分享。 
在过去,职业经理人用脚投票;在现在,提倡合伙人之间背靠背共进退。



 大 纲 附 件 请 您 查 阅
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm