from:"John Bauer"

[lustre-discuss] FIEMAP and DoM

2024-05-24 Thread John Bauer

Is FIEMAP supported on a DoM file?  I have a simple file that is a 2 
component PFL.  The first component is DoM.  My utility program that 
does the FIEMAP, and which works with non-DoM PFL files, does not work 
when the first component is DoM.


ioctl(FS_IOC_FIEMAP) fe_device=0 fe_logical=0 file_size=1048576000 error 
Unknown error 524


Is this as expected?

John


pfe27.jbauer2 286> lfs getstripe /nobackup/jbauer2/dd.dir/dd_test.dat
/nobackup/jbauer2/dd.dir/dd_test.dat
  lcm_layout_gen:    3
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id: 1
    lcme_mirror_id:  0
    lcme_flags:  init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
  lmm_stripe_count:  0
  lmm_stripe_size:   1048576
  lmm_pattern:   mdt
  lmm_layout_gen:    0
  lmm_stripe_offset: 0
  lmm_pool:  ssd-pool

    lcme_id: 2
    lcme_mirror_id:  0
    lcme_flags:  init
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
  lmm_stripe_count:  6
  lmm_stripe_size:   5242880
  lmm_pattern:   raid0
  lmm_layout_gen:    0
  lmm_stripe_offset: 121
  lmm_pool:  ssd-pool
  lmm_objects:
  - 0: { l_ost_idx: 121, l_fid: [0x10079:0x3273e83:0x0] }
  - 1: { l_ost_idx: 131, l_fid: [0x10083:0x325c4b2:0x0] }
  - 2: { l_ost_idx: 116, l_fid: [0x10074:0x31e236f:0x0] }
  - 3: { l_ost_idx: 117, l_fid: [0x10075:0x317dbf2:0x0] }
  - 4: { l_ost_idx: 124, l_fid: [0x1007c:0x320ec8d:0x0] }
  - 5: { l_ost_idx: 125, l_fid: [0x1007d:0x32094a8:0x0] }

pfe27.jbauer2 287> llfie /nobackup/jbauer2/dd.dir/dd_test.dat
StripeChunks_get() ioctl(FS_IOC_FIEMAP) fe_device=0 fe_logical=0 
file_size=1048576000 error Unknown error 524

StripeChunks_get() fm_mapped_extents=0
pfe27.jbauer2 288>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Unexpected result with overstriping

2024-05-17 Thread John Bauer


Good morning all,

I am playing around with overstriping a bit and I found a behavior that, 
to me, would seem unexpected.  The documentation for -C -1  indicates 
that the file should be striped over all available OSTs.  The pool, 
which happens to be the default, is ssd-pool which has 32 OSTs.  I got a 
stripeCount of 2000.  Is this as expected?


pfe20.jbauer2 213> rm -f /nobackup/jbauer2/ddd.dat
pfe20.jbauer2 214> lfs setstripe -C -1 /nobackup/jbauer2/ddd.dat
pfe20.jbauer2 215> lfs getstripe /nobackup/jbauer2/ddd.dat
/nobackup/jbauer2/ddd.dat
lmm_stripe_count:  2000
lmm_stripe_size:   1048576
lmm_pattern:   raid0,overstriped
lmm_layout_gen:    0
lmm_stripe_offset: 119
lmm_pool:  ssd-pool
    obdidx         objid         objid         group
       119      52386287        0x31f59ef     0
       123      52347947        0x31ec42b     0
       127      52734487        0x324aa17     0
       121      52839396        0x32643e4     0
       131      52742709        0x324ca35     0
       116      52242659        0x31d28e3     0
       117      51831125        0x316e155     0
       124      52425218        0x31ff202     0
       125      52402722        0x31f9a22     0
       106      52700581        0x32425a5     0

edited for brevity

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre-discuss Digest, Vol 217, Issue 19 Files created in append mode don't obey

2024-04-29 Thread John Bauer

I ran strace just to determine what is called here.  bash used 
openat(,,O_WRONLY,O_CREAT|O_APPEND) to open the file and I thought 
openat() might be the issue, so I wrote a simple program to test. Turns 
out that it is the O_APPEND bit, with either open() or openat() that 
causes the behavior.


On 4/29/24 10:21, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Re: [EXTERNAL] [BULK] Files created in append mode don't obey
   directory default stripe count
   (Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.])


--

Message: 1
Date: Mon, 29 Apr 2024 15:21:06 +
From: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]"

To: "Otto, Frank" ,
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] [EXTERNAL] [BULK] Files created in
append mode don't obey directory default stripe count
Message-ID:



Content-Type: text/plain; charset="windows-1252"

Wow, I would say that is definitely not expected.  I can recreate this on both 
of our LFS?s.  One is community lustre 2.14, the other is a DDN Exascalar.  
Shown below is our community lustre but we also have a 3-segment PFL on our 
Exascalar and the behavor is the same there.

$ echo > aaa
$ echo >> bbb
$ lfs getstripe aaa bbb
aaa
   lcm_layout_gen:3
   lcm_mirror_count:  1
   lcm_entry_count:   3
 lcme_id: 1
 lcme_mirror_id:  0
 lcme_flags:  init
 lcme_extent.e_start: 0
 lcme_extent.e_end:   33554432
   lmm_stripe_count:  1
   lmm_stripe_size:   4194304
   lmm_pattern:   raid0
   lmm_layout_gen:0
   lmm_stripe_offset: 6
   lmm_objects:
   - 0: { l_ost_idx: 6, l_fid: [0x10006:0xace8112:0x0] }

 lcme_id: 2
 lcme_mirror_id:  0
 lcme_flags:  0
 lcme_extent.e_start: 33554432
 lcme_extent.e_end:   10737418240
   lmm_stripe_count:  4
   lmm_stripe_size:   4194304
   lmm_pattern:   raid0
   lmm_layout_gen:0
   lmm_stripe_offset: -1

 lcme_id: 3
 lcme_mirror_id:  0
 lcme_flags:  0
 lcme_extent.e_start: 10737418240
 lcme_extent.e_end:   EOF
   lmm_stripe_count:  8
   lmm_stripe_size:   4194304
   lmm_pattern:   raid0
   lmm_layout_gen:0
   lmm_stripe_offset: -1

bbb
lmm_stripe_count:  1
lmm_stripe_size:   2097152
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 3
 obdidx  objidobjid 
   group
  3 179773949   0xab721fd   0


From: lustre-discuss  on behalf of Otto, 
Frank via lustre-discuss 
Date: Monday, April 29, 2024 at 8:33 AM
To: lustre-discuss@lists.lustre.org 
Subject: [EXTERNAL] [BULK] [lustre-discuss] Files created in append mode don't 
obey directory default stripe count
CAUTION: This email originated from outside of NASA.  Please take care when clicking 
links or opening attachments.  Use the "Report Message" button to report 
suspicious messages to the NASA SOC.


See subject. Is it a known issue? Is it expected? Easy to reproduce:


# lfs getstripe .
.
stripe_count:  4 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1

# echo > aaa
# echo >> bbb
# lfs getstripe .
.
stripe_count:  4 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1

./aaa
lmm_stripe_count:  4
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 0
 obdidx   objid   objid   group
  02830  0xb0e0
  12894  0xb4e0
  22831  0xb0f0
  32895  0xb4f0

./bbb
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 4
 obdidx   objid   objid   group
  42831  0xb0f0



As you see, file "bbb" is created with stripe count 1 instead of 4.
Observed in Lustre 2.12.x and Lustre 2.15.4.

Thanks,
Frank

--
Dr. Frank Otto
Senior Research Infrastructure Developer
UCL Centre for Advanced Research Computing
Tel: 020 7679 1506
-- next part

Re: [lustre-discuss] lustre-discuss Digest, Vol 213, Issue 10

2023-12-07 Thread John Bauer


I'm having a bad typing day.  One last time

Here's the image that was removed.

https://www.dropbox.com/scl/fi/augs88r7lcdfd6wb7nwrf/pfe27_allOSC_cached.png?rlkey=ynaw60yknwmfjavfy5gxsuk76=0

On 12/7/23 09:36, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Lustre caching and NUMA nodes (John Bauer)


--

Message: 1
Date: Thu, 7 Dec 2023 09:36:00 -0600
From: John Bauer 
To: lustre-discuss ,
p...@lustre.list.sabi.co.uk
Subject: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID: <4b63855e-8fe4-19dc-d781-d56807938...@iodoctors.com>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Peter,

A delayed reply to one more of your questions, "What makes you think
"lustre" is doing that?" , as I had to make another run and gather OSC
stats on all the Lustre file systems mounted on the host that I run dd on.

This host has 12 Lustre file systems, comprised of 507 OSTs. While dd
was running I instrumented the amount of cached data associated with all
507 OSCs.? That is reflected in the bottom frame of the image below.?
Note that in the top frame there was always about 5GB of free memory,
and 50GB of cached data.? I believe it has to be a Lustre issue as the
Linux buffer cache has no knowledge that a page is a Lustre page.? How
is it that every OSC, on all 12 file systems on the host, has its memory
dropped to 0, yet all the other 50GB of cached data on the host remains?
It's as though dropcache is being run on only the lustre file systems.?
My googling around finds no such feature in dropcache that would allow
file system specific dropping.? Is there some tuneable that gives Lustre
pages higher potential for eviction than other cached data?

Another subtle point of interest.? Note that dd writing resumes, as
reflected in the growth of the cached data for its 8 OSTs, before all
the other OSCs have finished dumping.? This is most visible around 2.1
seconds into the run.? Also different is that this dumping phenomenon
happened 3 times in the course of a 10 second run, instead of just 1 as
in the previous run I was referencing, costing this dd run 1.2 seconds.

John


On 12/6/23 14:24, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

 1. Coordinating cluster start and shutdown? (Jan Andersen)
 2. Re: Lustre caching and NUMA nodes (Peter Grandi)
 3. Re: Coordinating cluster start and shutdown?
(Bertschinger, Thomas Andrew Hjorth)
 4. Lustre server still try to recover the lnet reply to the
depreciated clients (Huang, Qiulan)


--

Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +
From: Jan Andersen
To: lustre
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:<696fac02-df18-4fe1-967c-02c3bca42...@comind.io>
Content-Type: text/plain; charset=UTF-8; format=flowed

Are there any tools for coordinating the start and shutdown of lustre 
filesystem, so that the OSS systems don't attempt to mount disks before the MGT 
and MDT are online?


--

Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +
From:p...@lustre.list.sabi.co.uk  (Peter Grandi)
To: list Lustre discussion
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID:<25968.27606.536270.208...@petal.ty.sabi.co.uk>
Content-Type: text/plain; charset=iso-8859-1


I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.

How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.


This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not u

[lustre-discuss] Lustre caching and NUMA nodes

2023-12-07 Thread John Bauer


Peter,

A delayed reply to one more of your questions, "What makes you think 
"lustre" is doing that?" , as I had to make another run and gather OSC 
stats on all the Lustre file systems mounted on the host that I run dd on.


This host has 12 Lustre file systems, comprised of 507 OSTs. While dd 
was running I instrumented the amount of cached data associated with all 
507 OSCs.  That is reflected in the bottom frame of the image below.  
Note that in the top frame there was always about 5GB of free memory, 
and 50GB of cached data.  I believe it has to be a Lustre issue as the 
Linux buffer cache has no knowledge that a page is a Lustre page.  How 
is it that every OSC, on all 12 file systems on the host, has its memory 
dropped to 0, yet all the other 50GB of cached data on the host remains? 
It's as though dropcache is being run on only the lustre file systems.  
My googling around finds no such feature in dropcache that would allow 
file system specific dropping.  Is there some tuneable that gives Lustre 
pages higher potential for eviction than other cached data?


Another subtle point of interest.  Note that dd writing resumes, as 
reflected in the growth of the cached data for its 8 OSTs, before all 
the other OSCs have finished dumping.  This is most visible around 2.1 
seconds into the run.  Also different is that this dumping phenomenon 
happened 3 times in the course of a 10 second run, instead of just 1 as 
in the previous run I was referencing, costing this dd run 1.2 seconds.


John


On 12/6/23 14:24, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Coordinating cluster start and shutdown? (Jan Andersen)
2. Re: Lustre caching and NUMA nodes (Peter Grandi)
3. Re: Coordinating cluster start and shutdown?
   (Bertschinger, Thomas Andrew Hjorth)
4. Lustre server still try to recover the lnet reply to the
   depreciated clients (Huang, Qiulan)


--

Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +
From: Jan Andersen
To: lustre
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:<696fac02-df18-4fe1-967c-02c3bca42...@comind.io>
Content-Type: text/plain; charset=UTF-8; format=flowed

Are there any tools for coordinating the start and shutdown of lustre 
filesystem, so that the OSS systems don't attempt to mount disks before the MGT 
and MDT are online?


--

Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +
From:p...@lustre.list.sabi.co.uk  (Peter Grandi)
To: list Lustre discussion
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID:<25968.27606.536270.208...@petal.ty.sabi.co.uk>
Content-Type: text/plain; charset=iso-8859-1


I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.

How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.


This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not use node1's memory, when there was always plenty of
free memory on node0?

What makes you think "lustre" is doing that?

Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?

Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?

Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.


--

Message: 3
Date: Wed, 6 Dec 2023 16:00:38 +
From: "Bertschinger, Thomas Andrew Hjorth"
To: Jan Andersen, lustre

Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Hello Jan,

You can use the Pacemaker / Corosync high-availability software stack for this: 
specifically, ordering constraints [1] can be used.

Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its 
configuration is complex and difficult to get right, and it significantly 
complicates system administration. One downside of Pacemaker is that it is not 
easy to decouple the Pacemaker service from the Lustre services, meaning if you 
stop the Pacemaker service, it

Re: [lustre-discuss] lustre-discuss Digest, Vol 213, Issue 7

2023-12-06 Thread John Bauer


Peter,

I've been reading the Baeldung pages, among others, to gain some insight 
on Linux buffer cache behavior.


https://www.baeldung.com/linux/file-system-caching  
andhttps://docs.kernel.org/admin-guide/sysctl/vm.html

As can been seen in the first image below, Lustre is having no trouble 
keeping up with the dirty pages. Dirty pages are never more than 400MB 
on a 64GB system, well under 1%. This dirty page data is drawn from 
/proc/meminfo while dd is running. Here are some of the vm dirty settings.


vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 40
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200

I am not sure what to make of your following comment.  I should have 
stated that the dd command used for this was *dd if=/dev/zero of=pfl.dat 
bs=1M**count=8000* .  I will also point out that I came across this 
behavior while debugging another problem and I was simply using dd to 
create a pfl striped file so I could check how the file was laid out on 
the OSTs.  Over the course of many runs I kept noticing the pauses in 
the writes and it strikes me that the behavior is odd in that there is 
typically a significant amount of inactive file pages and free memory ( 
second image below ).  I don't understand why those inactive file pages 
are not evicted, or free memory used, before evicting the pfl.dat pages 
which were just written.  What is driving the LRU eviction here? Also 
should point out that the cached memory is always always well under the 
50% limit that is configured as Lustre's max.



Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.


https://www.dropbox.com/scl/fi/5seamxgscdrat1eu2t5zn/dd_swapped.png?rlkey=oyicyq2a8eeqlgohndgalisy0=0



On 12/6/23 14:24, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Coordinating cluster start and shutdown? (Jan Andersen)
2. Re: Lustre caching and NUMA nodes (Peter Grandi)
3. Re: Coordinating cluster start and shutdown?
   (Bertschinger, Thomas Andrew Hjorth)
4. Lustre server still try to recover the lnet reply to the
   depreciated clients (Huang, Qiulan)


--

Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +
From: Jan Andersen
To: lustre
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:<696fac02-df18-4fe1-967c-02c3bca42...@comind.io>
Content-Type: text/plain; charset=UTF-8; format=flowed

Are there any tools for coordinating the start and shutdown of lustre 
filesystem, so that the OSS systems don't attempt to mount disks before the MGT 
and MDT are online?


--

Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +
From:p...@lustre.list.sabi.co.uk  (Peter Grandi)
To: list Lustre discussion
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID:<25968.27606.536270.208...@petal.ty.sabi.co.uk>
Content-Type: text/plain; charset=iso-8859-1


I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.

How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.


This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not use node1's memory, when there was always plenty of
free memory on node0?

What makes you think "lustre" is doing that?

Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?

Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?

Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.


--

Message: 3
Date: Wed, 6 Dec 2023 16:00:38 +
From: "Bertschinger, Thomas Andrew Hjorth"
To: Jan Andersen, lustre

Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Hello Jan,

You can use the Pacemaker / Corosync high-availability software stack for this:

Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread John Bauer


Andreas,

Thanks for the reply.

Client version is 2.14.0_ddn98. Here is a plot of the 
*write_RPCs_in_flight* plot.  Snapshot every 50ms.  The max for any of 
the samples for any of the OSCs was 1.  No RPCs in flight while the OSCs 
were dumping memory.  The number following the OSC name in the legends 
is the sum of the *write_RPCs_in flight* for all the intervals.  To be 
honest, I have never really looked at the RPCs in flight numbers.  I'm 
running as a lowly user, so I don't have access to any of the server 
data, so I have nothing on osd-ldiskfs.*.brw_stats.


I should also point out that the backing storage on the servers is SSD, 
so I would think the commiting to storage on the server side should be 
pretty quick.


I'm trying to get a handle on how Linux buffer cache works. Everything I 
find on the web is pretty old.  Here's one from 2012. 
https://lwn.net/Articles/495543/


Can someone point me to something more current, and perhaps Lustre related?

As for images, I think the list server strips the images.  In previous 
postings, when I would include images , what I got back when the list 
server broadcast it out had the iamges stripped. I'll include the images 
and also a link to the image on DropBox.


Thanks again,

John

https://www.dropbox.com/scl/fi/fgmz4wazr6it9q2aeo0mb/write_RPCs_in_flight.png?rlkey=d3ri2w2n7isggvn05se4j3a6b=0


On 12/5/23 22:33, Andreas Dilger wrote:


On Dec 4, 2023, at 15:06, John Bauer  wrote:


I have a an OSC caching question.  I am running a dd process which 
writes an 8GB file.  The file is on lustre, striped 8x1M. This is run 
on a system that has 2 NUMA nodes (cpu sockets). All the data is 
apparently stored on one NUMA node (node1 in the plot below) until 
node1 runs out of free memory.  Then it appears that dd comes to a 
stop (no more writes complete) until lustre dumps the data from the 
node1.  Then dd continues writing, but now the data is stored on the 
second NUMA node, node0.  Why does lustre go to the trouble of 
dumping node1 and then not use node1's memory, when there was always 
plenty of free memory on node0?


I'll forego the explanation of the plot.  Hopefully it is clear 
enough.  If someone has questions about what the plot is depicting, 
please ask.


https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x=0 
<https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x=0>


Hi John,
thanks for your detailed analysis.  It would be good to include the 
client kernel and Lustre version in this case, as the page cache 
behaviour can vary dramatically between different versions.


The allocation of the page cache pages may actually be out of the 
control of Lustre, since they are typically being allocated by the 
kernel VM affine to the core where the process that is doing the IO is 
running.  It may be that the "dd" is rescheduled to run on node0 
during the IO, since the ptlrpcd threads will be busy processing all 
of the RPCs during this time, and then dd will start allocating pages 
from node0.


That said, it isn't clear why the client doesn't start flushing the 
dirty data from cache earlier?  Is it actually sending the data to the 
OSTs, but then waiting for the OSTs to reply that the data has been 
committed to the storage before dropping the cache?


It would be interesting to plot the 
osc.*.rpc_stats::write_rpcs_in_flight and ::pending_write_pages to see 
if the data is already in flight.  The osd-ldiskfs.*.brw_stats on the 
server would also useful to graph over the same period, if possible.


It *does* look like the "node1 dirty" is kept at a low value for the 
entire run, so it at least appears that RPCs are being sent, but there 
is no page reclaim triggered until memory is getting low.  Doing page 
reclaim is really the kernel's job, but it seems possible that the 
Lustre client may not be suitably notifying the kernel about the dirty 
pages and kicking it in the butt earlier to clean up the pages.


PS: my preference would be to just attach the image to the email 
instead of hosting it externally, since it is only 55 KB.  Is this 
blocked by the list server?


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Lustre client caching question

2023-08-14 Thread John Bauer

I have an application that reads a 70GB section of a file forwards and 
backwards multiple ( 14 ) times.  This is on a 64 GB system.  Monitoring 
/proc/meminfo  shows that the memory consumed by file cache bounces 
around the 32GB value.  The forward reads go at about 3.3GB/s.  What is 
disappointing is the backwards read performance.  One would think that 
after the file is read forwards the most recently used 32GB of the file 
should be in the system buffers and the reading of that the first 32GB 
of the file going backwards should be coming out of the system buffers.  
But the backwards reads generally perform at about 500MB/s.  Generally 
the first 1GB going backwards is at 5.5GB/s, but then the remaining 
backwards read is at the 500MB/s.


The Lustre client version is 2.12.9_ddn18.  The file is striped at 4x1M.

Is this expected behavior?

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] LUG23 speakers/presentations

2023-03-24 Thread John Bauer

Is the speaker/presentation itinerary for LUG23 going to be posted 
before the early registration date passes?


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] lfs setstripe with stripe_count=0

2023-02-21 Thread John Bauer

Something doesn't make sense to me when using lfs setstripe when 
specifying 0 for the stripe_count .  This first command works as 
expected.  The pool is the one specified, 2_hdd, and the -c 0 results in 
a stripe_count of 1 which I believe is the default for the file-system 
default ( per the lfs setstripe manpage ).


pfe24.jbauer2 1224> lfs setstripe -c 0 -p 2_hdd 
/nobackupp18/jbauer2/testing/

pfe24.jbauer2 1225> lfs getstripe -d /nobackupp18/jbauer2/testing/
stripe_count:  1 stripe_size:   1048576 pattern:   raid0 
stripe_offset: -1 pool:  2_hdd


pfe24.jbauer2 1226>


If I do not specify the pool and only specify the stripe_count as 0, the 
resulting striping is what I believe is the pfl striping from the root 
directory of the file system.  Is this what is expected?  I would expect 
a stripe_count of 1, as above, with the pool from the parent directory's 
striping.


pfe24.jbauer2 1226> lfs setstripe -c 0 /nobackupp18/jbauer2/testing/
pfe24.jbauer2 1227> lfs getstripe -d /nobackupp18/jbauer2/testing/
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id: N/A
    lcme_mirror_id:  N/A
    lcme_flags:  prefer
    lcme_extent.e_start: 0
    lcme_extent.e_end:   268435456
  stripe_count:  1   stripe_size:   16777216 pattern:   
raid0   stripe_offset: -1 pool:  ssd-pool


    lcme_id: N/A
    lcme_mirror_id:  N/A
    lcme_flags:  prefer
    lcme_extent.e_start: 268435456
    lcme_extent.e_end:   5368709120
  stripe_count:  -1   stripe_size:   16777216 pattern:   
raid0   stripe_offset: -1 pool:  ssd-pool


    lcme_id: N/A
    lcme_mirror_id:  N/A
    lcme_flags:  0
    lcme_extent.e_start: 5368709120
    lcme_extent.e_end:   EOF
  stripe_count:  16   stripe_size:   16777216 pattern:   
raid0   stripe_offset: -1 pool:  hdd-pool


pfe24.jbauer2 1228>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] liblustreapi.so llapi_layout_get_by_fd() taking a long time to complete

2022-11-25 Thread John Bauer


Andreas,

Thanks for the reply.  After days of debugging, thinking it was my 
instrumentation code at the root of the problem, this morning I 
discovered the jobs will sometimes have the 260 second pause even when 
not using my instrumentation package.  The intermittent nature sent me 
on that wild goose chase.  What's odd is now the 260 second delay shows 
up during the MPI_File_write_at() phase of the job.  I no longer invoke 
my package's option to set and get the striping info for the file, so I 
am no longer calling llapi_layout_get_by_fd().  This is an h5perf run:


mpirun -n 16 h5perf -A mpiio -B 2K -e 256M -i 1 -p 16 -P 16 -w -x 128M 
-X 128M -I -o ~/h5perf/${output}.log


Here's a few of my instrumentation plots of the pwrite64() calls for all 
the ranks, with file Position on the Y axis and time on the X.  Note 
that each rank does 131072 pwrite64() calls of 2k bytes, strided by 
32k.  Lots of potential for lock contention. There is always one rank 
that gets off to a good start.  I suppose because once it is in the lead 
it does not have to deal with lock contention.  What's strange about 
this run is that it is the rank that is in the lead that hits the "road 
block" first and has the 260s delay before resuming its writes. Some of 
the trailing ranks blow through the "road block" and continue writing.  
The other ranks also block on the same area of the file and pause for 
the 260 seconds.  Again, all ranks are on different nodes(hosts).  Why 
would some ranks pause and some not?  Why would the lead rank even pause 
at all?  Today I will try to associate which OSTs/OSSs are behind the 
road block.  Is there something in /proc/fs/lustre/osc/ that sheds light 
on RTC timeouts?


John

https://www.dropbox.com/s/6uac96xhr79pj73/h5perf_1.png?dl=0

https://www.dropbox.com/s/tebm1iy0jipnqgx/h5perf_2.png?dl=0

https://www.dropbox.com/s/ydgzstkg9qrk6z4/h5perf_3.png?dl=0
On 11/24/22 20:47, Andreas Dilger wrote:

On Nov 22, 2022, at 13:57, John Bauer  wrote:


Hi all,

I am making a call to *llapi_layout_get_by_fd()*  from each rank of a 
16 rank MPI job.  One rank per node.


About 75% of the time, one of the ranks, typically rank 0, takes a 
very long time to complete this call.  I have placed fprintf() calls 
with wall clock timers around the call.  If it does take a long time 
it is generally about 260 seconds.  Otherwise it takes only 
micro-seconds.


How I access llapi_layout_get_by_fd() :

liblustreapi = dlopen("liblustreapi.so", RTLD_LAZY ) ;
LLAPI.layout_get_by_fd = dlsym( liblustreapi, 
"llapi_layout_get_by_fd" ) ;


How I call llapi_layout_get_by_fd() :
if(dbg)fprintf(stderr,"%s %12.8f %s() before 
LLAPI.layout_get_by_fd()\n",host,rtc(),__func__);

   struct llapi_layout *layout = (*LLAPI.layout_get_by_fd)( fd, 0);
if(dbg)fprintf(stderr,"%s %12.8f %s() after 
LLAPI.layout_get_by_fd()\n",host,rtc(),__func__);


The resulting prints from rank 0 :

r401i2n10   7.22477698 LustreLayout_get_by_fd() before 
LLAPI.layout_get_by_fd()
r401i2n10 269.52539992 LustreLayout_get_by_fd() after 
LLAPI.layout_get_by_fd()


Any ideas on what might be triggering this. The layout returned seems 
to be correct every time, whether it takes a long time or not.  The 
layout returned has the correct striping information, but the 
component has no OSTs as the component has yet to be instantiated for 
the new file.




Running under strace/ltrace would show where the slowdown is, and 
Lustre kernel debug logs would be needed to isolate this to a specific 
piece of code.  Given the length of time it is likely that an RPC is 
timing out (presumably nothing is printed on the console logs), but 
you'd need to look at exactly what is happening.


It's a *bit* strange, because this call is essentially equivalent to 
"getxattr", but over the years a bunch of cruft has been added and it 
is probably doing a lot more than it should...  You could potentially 
use (approximately):


       fgetxattr(fd, "lustre.lov", buf, buflen);
       llapi_layout_get_by_xattr(buf, buflen, 0);

but then we wouldn't know what is making this slow and you couldn't 
submit a patch to fix it...


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] liblustreapi.so llapi_layout_get_by_fd() taking a long time to complete

2022-11-22 Thread John Bauer


Hi all,

I am making a call to *llapi_layout_get_by_fd()*  from each rank of a 16 
rank MPI job.  One rank per node.


About 75% of the time, one of the ranks, typically rank 0, takes a very 
long time to complete this call.  I have placed fprintf() calls with 
wall clock timers around the call.  If it does take a long time it is 
generally about 260 seconds.  Otherwise it takes only micro-seconds.


How I access llapi_layout_get_by_fd() :

liblustreapi = dlopen("liblustreapi.so", RTLD_LAZY ) ;
LLAPI.layout_get_by_fd = dlsym( liblustreapi, "llapi_layout_get_by_fd" ) ;

How I call llapi_layout_get_by_fd() :
if(dbg)fprintf(stderr,"%s %12.8f %s() before 
LLAPI.layout_get_by_fd()\n",host,rtc(),__func__);

   struct llapi_layout *layout = (*LLAPI.layout_get_by_fd)( fd, 0);
if(dbg)fprintf(stderr,"%s %12.8f %s() after 
LLAPI.layout_get_by_fd()\n",host,rtc(),__func__);


The resulting prints from rank 0 :

r401i2n10 7.22477698 LustreLayout_get_by_fd() before 
LLAPI.layout_get_by_fd()
r401i2n10 269.52539992 LustreLayout_get_by_fd() after 
LLAPI.layout_get_by_fd()


Any ideas on what might be triggering this.  The layout returned seems 
to be correct every time, whether it takes a long time or not.  The 
layout returned has the correct striping information, but the component 
has no OSTs as the component has yet to be instantiated for the new file.


cat /sys/fs/lustre/version

2.12.8_ddn12


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] fio and lustre performance

2022-08-25 Thread John Bauer


Hi all,

I'm trying to figure out an odd behavior when running an fio ( 
https://git.kernel.dk/cgit/fio/  ) 
benchmark on a Lustre file system.


fio--randrepeat=1  \
--ioengine=posixaio  \
--buffered=1  \
--gtod_reduce=1  \
--name=test  \
--filename=${fileName}  \
--bs=1M  \
--iodepth=64  \
--size=40G  \
--readwrite=randwrite

In short, the application queues 40,000 random aio_write64(nbyte=1M) to 
a maximum depth of 64, doing aio_suspend64 followed by aio_write to keep 
64 outstanding aio requests.  My I/O library that processes the aio 
requests does so with 4 pthreads removing aio requests from the queue 
and doing the I/Os as pwrite64()s.  The odd behavior is the intermittent 
pauses that can been seen in the first plot below.  The X-axis is wall 
clock time, in seconds, and the left Y-axis is file postition. The 
horizontal blue lines indicate the amount of time each of the pwrite64 
is active and where in the file the I/O is occurring. The right Y-axis 
is the cumulative cpu times for both the process and kernel during the 
run.  There is minimal user cpu time, for either the process or kernel.  
The cumulative system cpu time attributable to the process ( the red 
line ) runs at a slope of ~4 system cpu seconds per wall clock second.  
Makes sense since there are 4 pthreads at work in the user process.  The 
cumulative system cpu time for the kernel as a whole ( the green line ) 
is ~12 system cpu seconds per wall clock second.  Note that during the 
pauses the system cpu accumulation drops to near zero ( zero slope ).


This is run on a dedicated ivybridge node with 40 cores Intel(R) Xeon(R) 
CPU E5-2680 v2 @ 2.80GHz


The node has 64G of memory.

The file is striped single component PFL, 8x1M.  Lustre version *2.12.8 
ddn12*


Does anyone have any ideas what is causing the pauses?  Is there 
something else I could be looking at in the /proc file system to gain 
insight?


For comparison, the 2nd plot below is when run on /tmp.  Note that there 
are some pwrite64() that take a long time, but a single pwrite64() 
taking a long time does not stop all the other pwrite64() active during 
the same time period.  Elapsed time for /tmp is 13 seconds. Lustre is 28 
seconds.  Both are essentially memory resident.


Just for completeness I have added a 3rd plot which is the amount of 
memory each of the OSC clients is consuming over the course of the 
Lustre run.  Nothing unusual there.  The memory consumption rate slows 
down during the pauses as one would expect.


I don't think the instrumentation is the issue, as there is not much 
more instrumentation occurring in the Lustre run versus /tmp, and they 
are both less than 6MB each in total.


John

In case the images got stripped here are some URLs to dropbox

plot1 : https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0

plot2 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0

plot3 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] fiemap, final chapter.

2022-08-19 Thread John Bauer


Andreas,

As I mentioned in an earlier email, this had been working for a long 
time.  I think that using an old header file is at the root of the 
issue.  On my development platform, which doesn't have Lustre installed, 
nor did I have e2fsprogs installed, I had simply copied the Lustre files 
I needed from the site I was working with.  The fiemap.h file I was 
using , the top of which is shown below ( I see you mentioned ) has 
fe_device explicitly in the structure.  Was it this way before the 
#define fe_device was implemented?  The #define was using the 
fe_reserved[0], which always had a 0 value.  What puzzles me is why this 
ever worked at all.  That will have to wait for a rainy day to mess 
with.  What started me down this path at this time was getting my lustre 
extents plotting program working with PFL.


Again, thanks much for your excellent/quick assistance in tracking this 
down.


John

/*
* FS_IOC_FIEMAP ioctl infrastructure.
*
* Some portions copyright (C) 2007 Cluster File Systems, Inc
*
* Authors: Mark Fasheh 
* Kalpak Shah 
* Andreas Dilger 
*/

#ifndef _LINUX_FIEMAP_H
#define _LINUX_FIEMAP_H

struct  fiemap_extent {
__u64 fe_logical;/* logical offset in bytes for the start of
* the extent from the beginning of the file */
__u64 fe_physical;/* physical offset in bytes for the start
* of the extent from the beginning of the disk */
__u64 fe_length;/* length in bytes for this extent */
__u64 fe_reserved64[2];
__u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_device;/* device number (fs-specific if FIEMAP_EXTENT_NET)*/
__u32 fe_reserved[2];
};

On 8/18/22 23:44, Andreas Dilger wrote:
The "fe_device" field is actually Lustre-specific, so it is a macro 
that overlays on fe_reserved[0]:


 #define fe_device       fe_reserved[0]

but that shouldn't affect compiler alignment.  On my system, "pahole 
lustre/llite/lustre.ko" reports:


struct fiemap_extent {
        __u64                      fe_logical;           /*   0     8 */
        __u64                      fe_physical;          /*   8     8 */
        __u64                      fe_length;            /*   16     8 */
        __u64                      fe_reserved64[2];     /*   24    16 */
        __u32                      fe_flags;             /*   40     4 */
        __u32                      fe_reserved[3];       /*   44    12 */

        /* size: 56, cachelines: 1, members: 6 */
        /* last cacheline: 56 bytes */
};

So there is definitely something going wrong with the struct alignment 
for fe_reserved, even though there doesn't need to be (all of the 
fields have "natural" alignment on their 4/8-byte sizes.


The other thing that is strange is that you show only 2 fe_reserved[] 
fields, when I have 3.  Is there some other field added to your 
version of struct fiemap_extent after fe_flags?  I don't see anything 
in the upstream kernel, nor in the Lustre headers.


You could try adding "__attribute__((packed))" at the end of the 
struct definition to see if that fixes the problem.


Cheers, Andreas


On Aug 18, 2022, at 21:54, John Bauer  wrote:

Andreas,

This is no longer Lustre related, but I hope you can shed some light 
on this.  It appears that my compilier, gcc 8.5.0, which I upgraded 
to recently when I upgraded my build system to Centos 8, is not 
padding the struct fiemap_extent correctly.  I put the following 
prints in to see whats going on.  The sizeof the structure is good at 
56, but notice that both fe_device and fe_reserved[0] have an offset 
of 48 bytes into the structure.  Odd that the sizeof fe_flags is 4, 
but fe_device is 8 bytes away from it. I traced the compile to ensure 
that I am getting the lustre_include/ext2fs/fiemap.h and there is 
nothing odd in the fiemap.h ( it's the one I've been using for years 
).  Any thoughts on how to remedy this?


John

fprintf(stderr,"%s() logical  %d %ld\n", __func__, sizeof(fm_ext->fe_logical ), (char 
*)_ext->fe_logical  -(char *)fm_ext);
fprintf(stderr,"%s() physical %d %ld\n", __func__, sizeof(fm_ext->fe_physical), (char 
*)_ext->fe_physical -(char *)fm_ext);
fprintf(stderr,"%s() length   %d %ld\n", __func__, sizeof(fm_ext->fe_length  ), (char 
*)_ext->fe_length -  (char *)fm_ext);
fprintf(stderr,"%s() res64[0] %d %ld\n", __func__, sizeof(fm_ext->fe_reserved64[0]), 
(char *)_ext->fe_reserved64[0] -  (char *)fm_ext);
fprintf(stderr,"%s() res64[1] %d %ld\n", __func__, sizeof(fm_ext->fe_reserved64[1]), 
(char *)_ext->fe_reserved64[1] -  (char *)fm_ext);
fprintf(stderr,"%s() flags%d %ld\n", __func__, sizeof(fm_ext->fe_flags   ), (char 
*)_ext->fe_flags-(char *)fm_ext);
fprintf(stderr,"%s() device   %d %ld\n", __func__, sizeof(fm_ext->fe_device  ), (char 
*)_ext->fe_device   -(char *)fm_ext);
fprintf(stderr,"%s() res32[0] %d %ld\n&q

Re: [lustre-discuss] fiemap

2022-08-18 Thread John Bauer


Andreas,

Well, that works.  I got the devices I would expect.  The ioctl() calls 
look identical.  The lengths are identical ( allowing for 1024 factor 
).  But my devices are 0.  Thanks for getting me going with the correct 
filefrag.  I'll report back when I sort out my problem.


John

pfe27.jbauer2 390> strace -o filefrag.strace ./misc/filefrag -v 
/nobackupp17/jbauer2/dd.dat
Filesystem type is: bd00bd0
File size of /nobackupp17/jbauer2/dd.dat is 104857600 (102400 blocks of 1024 
bytes)
 ext: device_logical:physical_offset: length:  dev: flags:
   0:0..   13311:  33431977984.. 33431991295:  13312: 0008: net
   1:0..   13311: 164044554240..164044567551:  13312: 0009: net
   2:0..   13311: 539103838208..539103851519:  13312: 000a: net
   3:0..   13311:  48145154048.. 48145167359:  13312: 000b: net
   4:0..   12287: 168782233600..168782245887:  12288: 000c: net
   5:0..   12287: 168137900032..168137912319:  12288: 000d: net
   6:0..   12287:  18729435136.. 18729447423:  12288: 000e: net
   7:0..   12287: 163376496640..163376508927:  12288: 000f: last,net
/nobackupp17/jbauer2/dd.dat: 8 extents found

strace lines of interest for filefrag

ioctl(3,FS_IOC_FIEMAP,{fm_start=0, fm_length=18446744073709551615, 
fm_flags=0x4000  /* FIEMAP_FLAG_??? */, fm_extent_count=292}  
=>{fm_flags=0x4000  /* FIEMAP_FLAG_??? */, fm_mapped_extents=8, ...})  =  0
write(1," ext: device_logical: "...,75)  =  75
write(1," 0: 0.. 13311: 33431"...,72)  =  72
write(1," 1: 0.. 13311: 164044"...,72)  =  72

strace lines of interest for llfie

ioctl(3,FS_IOC_FIEMAP,{fm_start=0, fm_length=18446744073709551615, 
fm_flags=0x4000  /* FIEMAP_FLAG_??? */, fm_extent_count=1024}  
=>{fm_flags=0x4000  /* FIEMAP_FLAG_??? */, fm_mapped_extents=8, ...})  =  0
write(2,"listExtents() fe_physical=342343"...,72)  =  72
write(2,"listExtents() fe_physical=167981"...,73)  =  73
write(2,"listExtents() fe_physical=552042"...,73)  =  73
write(2,"listExtents() fe_physical=493006"...,72)  =  72
write(2,"listExtents() fe_physical=172833"...,73)  =  73

On 8/18/22 16:11, Andreas Dilger wrote:

On Aug 18, 2022, at 14:28, John Bauer  wrote:


Andreas,

Thanks for the reply.  I don't think I'm accessing the Lustre 
filefrag ( see below ).  Where would I normally find that installed? 
I downloaded the lustre-release git repository and can't find 
filefrag stuff to build my own.  Is that somewhere else?


filefrag is part of the e2fsprogs package ("rpm -qf $(which 
filefrag)"), so you need to download and install the Lustre-patched 
e2fsprogs from _https://downloads.whamcloud.com/public/e2fsprogs/latest/_



More info:

pfe27.jbauer2 334> cat /sys/fs/lustre/version
2.12.8_ddn12


You should really use "lctl get_param version", since the Lustre /proc 
and /sys files move around on occasion.


The PFL/FLR change for FIEMAP is not included in this version, but it 
_should_ be irrelevant because the file you are testing is using a 
plain layout, not PFL or FLR.

pfe27.jbauer2 335> filefrag -v /nobackupp17/jbauer2/dd.dat
Filesystem type is: bd00bd0
File size of /nobackupp17/jbauer2/dd.dat is 104857600 (25600 blocks of 4096 
bytes)
/nobackupp17/jbauer2/dd.dat: FIBMAP unsupported

pfe27.jbauer2 336> which filefrag
/usr/sbin/filefrag


John

On 8/18/22 14:57, Andreas Dilger wrote:
What version of Lustre are you using?  Does "filefrag -v" from a 
newer Lustre e2fsprogs (1.45.6.wc3+) work properly?


There was a small change to the Lustre FIEMAP handling in order to 
handle overstriped files and PFL/FLR files with many stripes and 
multiple components, since the FIEMAP "restart" mechanism was broken 
for files that had multiple objects on the same OST index.  See 
LU-11484 for details.  That change was included in the 2.14.0 release.


In essence, the fe_device field now encodes the absolute file stripe 
number in the high 16 bits of fe_device, and the device number in 
the low 16 bits (as it did before).   Since "filefrag -v" prints 
fe_device in hex and would show as "0x" instead of 
"0x", this was considered an acceptable tradeoff 
compared to other "less compatible" changes that would have been 
needed to implement PFL/FLR handling.


That said, I would have expected this change to result in your tool 
reporting very large values for fe_device (e.g. OST index + N * 
65536), so returning all-zero values is somewhat unexpected.


Cheers, Andreas


On Aug 18, 2022, at 06:27, John Bauer  wrote:

Hi all,

I am trying to get my llfie program (which uses fiemap) going 
again, but now the struct fiemap_extent structures I get back from 
the ioctl call, all have fe_device=0.  The output from lfs 
getstripe indicates that the devices are not all 0.  The sum of the 
fe_length members adds up to the file si

Re: [lustre-discuss] fiemap

2022-08-18 Thread John Bauer


Andreas,

Thanks for the reply.  I don't think I'm accessing the Lustre filefrag ( 
see below ).  Where would I normally find that installed? I downloaded 
the lustre-release git repository and can't find filefrag stuff to build 
my own.  Is that somewhere else? More info:


pfe27.jbauer2 334> cat /sys/fs/lustre/version
2.12.8_ddn12

pfe27.jbauer2 335> filefrag -v /nobackupp17/jbauer2/dd.dat
Filesystem type is: bd00bd0
File size of /nobackupp17/jbauer2/dd.dat is 104857600 (25600 blocks of 4096 
bytes)
/nobackupp17/jbauer2/dd.dat: FIBMAP unsupported

pfe27.jbauer2 336> which filefrag
/usr/sbin/filefrag


John

On 8/18/22 14:57, Andreas Dilger wrote:
What version of Lustre are you using?  Does "filefrag -v" from a newer 
Lustre e2fsprogs (1.45.6.wc3+) work properly?


There was a small change to the Lustre FIEMAP handling in order to 
handle overstriped files and PFL/FLR files with many stripes and 
multiple components, since the FIEMAP "restart" mechanism was broken 
for files that had multiple objects on the same OST index.  See 
LU-11484 for details.  That change was included in the 2.14.0 release.


In essence, the fe_device field now encodes the absolute file stripe 
number in the high 16 bits of fe_device, and the device number in the 
low 16 bits (as it did before). Since "filefrag -v" prints fe_device 
in hex and would show as "0x" instead of 
"0x", this was considered an acceptable tradeoff compared 
to other "less compatible" changes that would have been needed to 
implement PFL/FLR handling.


That said, I would have expected this change to result in your tool 
reporting very large values for fe_device (e.g. OST index + N * 
65536), so returning all-zero values is somewhat unexpected.


Cheers, Andreas


On Aug 18, 2022, at 06:27, John Bauer  wrote:

Hi all,

I am trying to get my llfie program (which uses fiemap) going again, 
but now the struct fiemap_extent structures I get back from the ioctl 
call, all have fe_device=0. The output from lfs getstripe indicates 
that the devices are not all 0.  The sum of the fe_length members 
adds up to the file size, so that is working.  The fe_physical 
members look reasonable also.  Has something changed? This used to work.


Thanks, John

pfe27.jbauer2 300> llfie /nobackupp17/jbauer2/dd.dat
LustreStripeInfo_get() lum->lmm_magic=0xbd30bd0
listExtents() fe_physical=30643484360704 fe_device=0 fe_length=16777216
listExtents() fe_physical=30646084829184 fe_device=0 fe_length=2097152
listExtents() fe_physical=5705226518528 fe_device=0 fe_length=16777216
listExtents() fe_physical=5710209351680 fe_device=0 fe_length=2097152
listExtents() fe_physical=30621271326720 fe_device=0 fe_length=16777216
listExtents() fe_physical=31761568366592 fe_device=0 fe_length=16777216
listExtents() fe_physical=24757567225856 fe_device=0 fe_length=16777216
listExtents() fe_physical=14196460748800 fe_device=0 fe_length=16777216
listExtents() nMapped=8 byteCount=104857600


pfe27.jbauer2 301> lfs getstripe /nobackupp17/jbauer2/dd.dat
/nobackupp17/jbauer2/dd.dat
lmm_stripe_count:  6
lmm_stripe_size:   2097152
lmm_pattern:   raid0
lmm_layout_gen:    0
lmm_stripe_offset: 126
lmm_pool:  ssd-pool
obdidxobjidobjidgroup
  126 139300250xd48e290
  113 131158890xc821f10
  120 140031760xd5abe80
  109 127854830xc3174b0
  102 13870xd2bdad0
  116 133772850xcc1f050

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] fiemap

2022-08-18 Thread John Bauer


Hi all,

I am trying to get my llfie program (which uses fiemap) going again, but 
now the struct fiemap_extent structures I get back from the ioctl call, 
all have fe_device=0.  The output from lfs getstripe indicates that the 
devices are not all 0.  The sum of the fe_length members adds up to the 
file size, so that is working.  The fe_physical members look reasonable 
also.  Has something changed?  This used to work.


Thanks, John

pfe27.jbauer2 300> llfie /nobackupp17/jbauer2/dd.dat
LustreStripeInfo_get() lum->lmm_magic=0xbd30bd0
listExtents() fe_physical=30643484360704 fe_device=0 fe_length=16777216
listExtents() fe_physical=30646084829184 fe_device=0 fe_length=2097152
listExtents() fe_physical=5705226518528 fe_device=0 fe_length=16777216
listExtents() fe_physical=5710209351680 fe_device=0 fe_length=2097152
listExtents() fe_physical=30621271326720 fe_device=0 fe_length=16777216
listExtents() fe_physical=31761568366592 fe_device=0 fe_length=16777216
listExtents() fe_physical=24757567225856 fe_device=0 fe_length=16777216
listExtents() fe_physical=14196460748800 fe_device=0 fe_length=16777216
listExtents() nMapped=8 byteCount=104857600


pfe27.jbauer2 301> lfs getstripe /nobackupp17/jbauer2/dd.dat
/nobackupp17/jbauer2/dd.dat
lmm_stripe_count:  6
lmm_stripe_size:   2097152
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 126
lmm_pool:  ssd-pool
obdidx   objid   objid   group
   12613930025   0xd48e290
   11313115889   0xc821f10
   12014003176   0xd5abe80
   10912785483   0xc3174b0
   1021387   0xd2bdad0
   11613377285   0xcc1f050

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] llapi_layout_file_comp_del

2022-08-01 Thread John Bauer


Andeas and others,

Thanks again for all the info.  I guess its about time for me to attempt 
to contribute to the Lustre code base.  A question about 
llapi_layout_dom_size().  I have code that is building up a multi 
component layout.  After switching to the second component with 
llapi_layout_comp_use( layout, LLAPI_COMP_USE_NEXT ) I call 
llapi_layout_dom_size() so I know where to set the extent start for 
second component.  Internal to ...dom_size() the current component gets 
set back to the first component, see below.  I then make calls thinking 
I am modifying the second component, but I am actually modifying the 
first.  Is this behavior documented somewhere, that calls to 
llapi_layout_... functions can change the current component without notice?


Is there a llapi function that returns the lustre file system's max dom 
size?  I mistakenly thought this was the purpose of 
llapi_layout_dom_size(), but found out I made a bad assumption. Trying 
to  find the max dom_size is what sent me down this path.


John

int  llapi_layout_dom_size(struct  llapi_layout *layout,uint64_t  *size)
{
uint64_t  pattern, start;
int  rc;

if  (!layout || !llapi_layout_is_composite(layout)) {
*size =0;
return  0;
}

rc = llapi_layout_comp_use(layout, LLAPI_LAYOUT_COMP_USE_FIRST);
if  (rc)
return  -errno;

rc = llapi_layout_pattern_get(layout, );
if  (rc)
return  -errno;

if  (pattern != LOV_PATTERN_MDT && pattern != LLAPI_LAYOUT_MDT) {
*size =0;
return  0;
}

rc = llapi_layout_comp_extent_get(layout, , size);
if  (rc)
return  -errno;
if  (start)
return  -ERANGE;
return  0;
}

On 7/28/22 19:42, Andreas Dilger wrote:
John, you are probably right that allowing a passed fd would also be 
useful, but nobody has done it this way before because of the need to 
use O_LOV_DELAY_CREATE within the application code...  and to be 
honest very few applications tune their IO to this extent, especially 
with PFL layouts being available to handle 99% of the usage.


Please feel free to submit a patch that splits 
llapi_layout_file_open() into two functions:
- llapi_layout_open_fd() (or ...fd_open()? not sure) that only opens 
the file with O_LOV_DELAY_CREATE and returns the fd
- llapi_layout_set_by_fd() that sets the layout on the provided fd 
(whatever the source)


Then llapi_layout_file_open() would just call those two functions 
internally, and you could use only the llapi_layout_set_by_fd() 
function, if available (you could add:


#defeine HAVE_LLAPI_LAYOUT_SET_BY_FD

in the lustreapi.h header to avoid the need for a separate configure 
check.   Otherwise, your code would call your own internal wrapper of 
the same name that calls llapi_layout_file_open() and immediately 
closes the returned fd.  That would be slightly less efficient than 
the new API (two opens and closes), but would allow you to migrate to 
the new (more efficient) implementation easily in the future.


Cheers, Andreas


On Jul 28, 2022, at 14:03, John Bauer  wrote:

Andreas,

Thanks for the info.  A related question: I am using the 
O_LOV_DELAY_CREATE open flag mechanism to open a file and then set 
the composite layout with llapi_layout_file_open().  I was kind of 
surprised this worked.  This ends up opening the file twice and I 
simply close the fd returned from llapi_layout_file_open().  It would 
seem there should be an llapi function such as 
llapi_layout_set_by_fd() to match the llapi_layout_get_by_fd().  I 
need to use this mechanism to set striping for files where the 
pathname is not necessarily known before the open, such as the 
mkstemps() family of opens.  It also makes it easier to handle 
setting striping for files opened with openat().  It seems it would 
be more straight forward for llapi to work with an fd than a pathname 
if a valid fd already exists. Am I missing an easier way to do this?


Thanks,

John

On 7/27/22 16:25, Andreas Dilger wrote:
The HLD document was written before the feature was implemented, and 
is outdated.  The lustreapi.h and llapi_layout_file_comp_del.3 man 
page are correct.  Feel free to update the wiki to use the correct 
argument list.


I believe that it is possible to delete multiple components that 
match the  argument (e.g. LCME_FL_NEG|LCME_FL_INIT), but I 
haven't tested that.



On Jul 26, 2022, at 14:35, John Bauer  wrote:

Hi all,

I would like to use the llapi_layout_file_comp_del() function.  I 
have found 2 prototypes in different places.  One has the 3rd 
argument, uint32_t flags, and the other doesn't.  I suspect the 
High Level Design document is incorrect.  The one line of 
documentation in lustreapi.h indicates I could delete multiple 
components with one call.  How does one do that? What are the 
applicable flags?


From version 2.12.8 lustreapi.h

/**

Re: [lustre-discuss] llapi_layout_file_comp_del

2022-07-28 Thread John Bauer


Andreas,

Thanks for the info.  A related question:  I am using the 
O_LOV_DELAY_CREATE open flag mechanism to open a file and then set the 
composite layout with llapi_layout_file_open().  I was kind of surprised 
this worked.  This ends up opening the file twice and I simply close the 
fd returned from llapi_layout_file_open().  It would seem there should 
be an llapi function such as llapi_layout_set_by_fd() to match the 
llapi_layout_get_by_fd().  I need to use this mechanism to set striping 
for files where the pathname is not necessarily known before the open, 
such as the mkstemps() family of opens.  It also makes it easier to 
handle setting striping for files opened with openat().  It seems it 
would be more straight forward for llapi to work with an fd than a 
pathname if a valid fd already exists. Am I missing an easier way to do 
this?


Thanks,

John

On 7/27/22 16:25, Andreas Dilger wrote:
The HLD document was written before the feature was implemented, and 
is outdated.  The lustreapi.h and llapi_layout_file_comp_del.3 man 
page are correct.  Feel free to update the wiki to use the correct 
argument list.


I believe that it is possible to delete multiple components that match 
the  argument (e.g. LCME_FL_NEG|LCME_FL_INIT), but I haven't 
tested that.



On Jul 26, 2022, at 14:35, John Bauer  wrote:

Hi all,

I would like to use the llapi_layout_file_comp_del() function.  I 
have found 2 prototypes in different places.  One has the 3rd 
argument, uint32_t flags, and the other doesn't.  I suspect the High 
Level Design document is incorrect.  The one line of documentation in 
lustreapi.h indicates I could delete multiple components with one 
call.  How does one do that? What are the applicable flags?


From version 2.12.8 lustreapi.h

/**
* Delete component(s) by the specified component id or flags.
*/
int  llapi_layout_file_comp_del(const  char  *path, uint32_t  id, 
uint32_t  flags);



From https://wiki.lustre.org/PFL2_High_Level_Design

A new interface llapi_layout_file_comp_del(3) to delete component(s) 
by the specified component id (accepting LCME_ID_* wildcards also) 
from an existing file:


int llapi_layout_file_comp_del(const char *path, uint32_t id);

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] llapi_layout_file_comp_del

2022-07-26 Thread John Bauer


Hi all,

I would like to use the llapi_layout_file_comp_del() function.  I have 
found 2 prototypes in different places.  One has the 3rd argument, 
uint32_t flags, and the other doesn't.  I suspect the High Level Design 
document is incorrect.  The one line of documentation in lustreapi.h 
indicates I could delete multiple components with one call.  How does 
one do that? What are the applicable flags?


From version 2.12.8 lustreapi.h

/**
* Delete component(s) by the specified component id or flags.
*/
int  llapi_layout_file_comp_del(const  char  *path,uint32_t  id,uint32_t  
flags);


From https://wiki.lustre.org/PFL2_High_Level_Design

A new interface llapi_layout_file_comp_del(3) to delete component(s) by 
the specified component id (accepting LCME_ID_* wildcards also) from an 
existing file:


int llapi_layout_file_comp_del(const char *path, uint32_t id);

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] lfs getstripe -d

2022-07-18 Thread John Bauer


Hi all,

Looking in the documentation at 
https://doc.lustre.org/lustre_manual.xhtml it appears that lfs getstripe 
should/does echo the directory that is being queried.


$ lfs getstripe -d /mnt/testfs/pfldir
/mnt/testfs/pfldir
stripe_count:  1 stripe_size:   1048576 stripe_offset: -1
/mnt/testfs/pfldir/commonfile
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:   1
lmm_layout_gen:    0
lmm_stripe_offset: 0
    obdidx objid objid group
 2 9  0x9 0


But when I do similar on a Lustre 2.12.8 client, the directory being 
queried is not echoed.  Is this a bug?


pfe26.jbauer2 197> lfs getstripe -d /nobackupp17
  lcm_layout_gen:    0 # the directory is missing above this line
  lcm_mirror_count:  1
  lcm_entry_count:   3
    lcme_id: N/A
    lcme_mirror_id:  N/A
    lcme_flags:  0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   268435456
  stripe_count:  1   stripe_size:   16777216 pattern:   
raid0   stripe_offset: -1 pool:  ssd-pool


    lcme_id: N/A
    lcme_mirror_id:  N/A
    lcme_flags:  0
    lcme_extent.e_start: 268435456
    lcme_extent.e_end:   5368709120
  stripe_count:  16   stripe_size:   16777216 pattern:   
raid0   stripe_offset: -1 pool:  ssd-pool


    lcme_id: N/A
    lcme_mirror_id:  N/A
    lcme_flags:  0
    lcme_extent.e_start: 5368709120
    lcme_extent.e_end:   EOF
  stripe_count:  16   stripe_size:   16777216 pattern:   
raid0   stripe_offset: -1 pool:  hdd-pool


pfe26.jbauer2 198> cat /sys/fs/lustre/version
2.12.8_ddn6
pfe26.jbauer2
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] overstriping

2022-06-15 Thread John Bauer

Just a note that the man page for lfs setstripe seems a bit misleading 
for the -C --overstripe-count option.  The man page states "creating > 1 
stripe per OST if count exceeds the number of OSTs in  the file 
system".  It appears that lustre creates more than one stripe on an OST 
if the number of stripes exceeds the number of OSTs in the pool being 
used.  In my example below I selected a pool with 8 OSTs, and striped 16 
wide.  The file system has 64 OSTs.  16 is less than 64, but I still got 
2 stripes per OST.


   -C, --overstripe-count 
  The number of stripes to create, creating > 1 stripe per 
OST if count exceeds the number of OSTs in  the  file
  system.  0  means to use the filesystem-wide default 
stripe count (default 1), and -1 means to stripe over all

  available OSTs.


pfe22.jbauer2 908> lfs setstripe --pool 1_hdd -C 16 this.dat
pfe22.jbauer2 909> lfs getstripe this.dat
this.dat
lmm_stripe_count:  16
lmm_stripe_size:   1048576
lmm_pattern:   raid0,overstriped
lmm_layout_gen:    0
lmm_stripe_offset: 7
lmm_pool:  1_hdd
    obdidx         objid         objid         group
     7      23665469        0x1691b3d     0
     0      22497141        0x1574775     0
     4      23541897        0x1673889     0
     2      23308778        0x163a9ea     0
     1      23033000        0x15f74a8     0
     6      23687573        0x1697195     0
     5      23635915        0x168a7cb     0
     3      23446533        0x165c405     0
     7      23665470        0x1691b3e     0
     0      22497142        0x1574776     0
     4      23541898        0x167388a     0
     2      23308779        0x163a9eb     0
     1      23033001        0x15f74a9     0
     6      23687574        0x1697196     0
     5      23635916        0x168a7cc     0
     3      23446534        0x165c406     0
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] llapi documentation

2022-06-15 Thread John Bauer

While sitting stuck in traffic in a torrential downpour I figured out 3 
of my 4 questions. The only remain question:  Can something be done 
about the prints coming from  the llapi_get_pool...() functions?


llapi_get_poollist() generates a print of:*Pools from nbp17:*

llapi_get_poolmembers() generates a print of:*Pool: nbp17.1_ssd*

John

On 6/15/2022 7:53 AM, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Re: llapi documentation (Andreas Dilger)
2. Re: llapi documentation (John Bauer)


--

Message: 1
Date: Wed, 15 Jun 2022 07:08:58 +
From: Andreas Dilger
To: John Bauer
Cc:"lustre-discuss@lists.lustre.org"

Subject: Re: [lustre-discuss] llapi documentation
Message-ID:<19c08e40-b54a-485d-99e8-925886608...@ddn.com>
Content-Type: text/plain; charset="us-ascii"

On Jun 14, 2022, at 05:32, John Bauer 
mailto:bau...@iodoctors.com>> wrote:

I have had little success in my search for documentation on pool functions in 
llapi. I've looked in:

https://wiki.lustre.org/PFL2_High_Level_Design

https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace


I'm looking for info on llapi_get_poollist() and llapi_get_poolmembers().  I've 
found the prototype in /usr/include/lustre/lustreapi.h, but that's about it.

Can anyone point me to some documentation?

it looks like there aren't man pages for these functions, just the function 
comment blocks
in the code, and their usage internally:

/**
  * Get the list of pools in a filesystem.
  * \param namefilesystem name or path
  * \param poollistcaller-allocated array of char*
  * \param list_size   size of the poollist array
  * \param buffer  caller-allocated buffer for storing pool names
  * \param buffer_size size of the buffer
  *
  * \return number of pools retrieved for this filesystem
  * \retval -error failure
  */
int llapi_get_poollist(const char *name, char **poollist, int list_size,
char *buffer, int buffer_size)

/**
  * Get the list of pool members.
  * \param poolnamestring of format \.\
  * \param members caller-allocated array of char*
  * \param list_size   size of the members array
  * \param buffer  caller-allocated buffer for storing OST names
  * \param buffer_size size of the buffer
  *
  * \return number of members retrieved for this pool
  * \retval -error failure
  */
int llapi_get_poolmembers(const char *poolname, char **members,
   int list_size, char *buffer, int buffer_size)

Patches to add llapi_get_poollist.3 and llapi_get_poolmembers.3 (and related) 
man
pages welcome. The pool related functions should probably be moved into a new
liblustreapi_pool.c file to reduce the size of liblustreapi.c.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-- next part --
An HTML attachment was scrubbed...
URL:<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220615/ef3c7324/attachment-0001.htm>

--

Message: 2
Date: Wed, 15 Jun 2022 07:56:40 -0500
From: John Bauer
To: Andreas Dilger
Cc:"lustre-discuss@lists.lustre.org"

Subject: Re: [lustre-discuss] llapi documentation
Message-ID:<8b437470-1f49-1535-94f2-8492956a4...@iodoctors.com>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Andreas,

Thanks for the info.? It got me a lot farther down the road. A few comments:

1)? It appears that the values returned in the poollist argument to
llapi_get_poollist() are temporary.? I used the values in

poollist in the call to llapi_get_poolmembers( poollist[i],...). Works
great on the first call to get_poolmembers(), but subsequent calls

fail and it appears the poollist has been overwritten.? If I make my own
copy of the strings in poollist before calling get_poolmembers,
everythings works well.

If this is indeed the case, it should be noted.

2)? There are prints to stdout or stderr resulting from calls to these
llapi functions.

llapi_get_poollist generates a print of:Pools from nbp17:

llapi_get_poolmembers generates a print of:Pool: nbp17.1_ssd

Seems undesirable to have llapi doing unrequested prints.

3)? What is the buffer argument good for ( last argument for each of the
functions )?? It appears to be populated with the f

Re: [lustre-discuss] llapi documentation

2022-06-15 Thread John Bauer

()   poolNames[5]=nbp17.3_ssd members[0]=nbp17-OST0074_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[1]=nbp17-OST0075_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[2]=nbp17-OST0076_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[3]=nbp17-OST0077_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[4]=nbp17-OST0078_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[5]=nbp17-OST0079_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[6]=nbp17-OST007a_UUID
poollist_get()   poolNames[5]=nbp17.3_ssd members[7]=nbp17-OST007b_UUID
Pool: nbp17.4_hdd
poollist_get() poolNames[6]=nbp17.4_hdd 8 members buffer=nbp17-OST0018_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[0]=nbp17-OST0018_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[1]=nbp17-OST0019_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[2]=nbp17-OST001a_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[3]=nbp17-OST001b_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[4]=nbp17-OST001c_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[5]=nbp17-OST001d_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[6]=nbp17-OST001e_UUID
poollist_get()   poolNames[6]=nbp17.4_hdd members[7]=nbp17-OST001f_UUID
Pool: nbp17.4_ssd
poollist_get() poolNames[7]=nbp17.4_ssd 8 members buffer=nbp17-OST007c_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[0]=nbp17-OST007c_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[1]=nbp17-OST007d_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[2]=nbp17-OST007e_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[3]=nbp17-OST007f_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[4]=nbp17-OST0080_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[5]=nbp17-OST0081_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[6]=nbp17-OST0082_UUID
poollist_get()   poolNames[7]=nbp17.4_ssd members[7]=nbp17-OST0083_UUID
Pool: nbp17.hdd-pool
poollist_get() poolNames[8]=nbp17.hdd-pool -75 members 
buffer=nbp17-OST_UUID

Pool: nbp17.ssd-pool
poollist_get() poolNames[9]=nbp17.ssd-pool -75 members 
buffer=nbp17-OST0064_UUID


John

On 6/15/22 02:08, Andreas Dilger wrote:

On Jun 14, 2022, at 05:32, John Bauer  wrote:


I have had little success in my search for documentation on pool 
functions in llapi. I've looked in:


https://wiki.lustre.org/PFL2_High_Level_Design

https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace


I'm looking for info on llapi_get_poollist() and 
llapi_get_poolmembers().  I've found the prototype in 
/usr/include/lustre/lustreapi.h, but that's about it.


Can anyone point me to some documentation?


it looks like there aren't man pages for these functions, just the 
function comment blocks

in the code, and their usage internally:

/**
 * Get the list of pools in a filesystem.
 * \param name        filesystem name or path
 * \param poollist    caller-allocated array of char*
 * \param list_size   size of the poollist array
 * \param buffer      caller-allocated buffer for storing pool names
 * \param buffer_size size of the buffer
 *
 * \return number of pools retrieved for this filesystem
 * \retval -error failure
 */
int llapi_get_poollist(const char *name, char **poollist, int list_size,
                       char *buffer, int buffer_size)

/**
 * Get the list of pool members.
 * \param poolname    string of format \.\
 * \param members     caller-allocated array of char*
 * \param list_size   size of the members array
 * \param buffer      caller-allocated buffer for storing OST names
 * \param buffer_size size of the buffer
 *
 * \return number of members retrieved for this pool
 * \retval -error failure
 */
int llapi_get_poolmembers(const char *poolname, char **members,
  int list_size, char *buffer, int buffer_size)

Patches to add llapi_get_poollist.3 and llapi_get_poolmembers.3 (and 
related) man
pages welcome. The pool related functions should probably be moved 
into a new

liblustreapi_pool.c file to reduce the size of liblustreapi.c.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








On 6/15/22 02:08, Andreas Dilger wrote:

On Jun 14, 2022, at 05:32, John Bauer  wrote:


I have had little success in my search for documentation on pool 
functions in llapi. I've looked in:


https://wiki.lustre.org/PFL2_High_Level_Design

https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace


I'm looking for info on llapi_get_poollist() and 
llapi_get_poolmembers().  I've found the prototype in 
/usr/include/lustre/lustreapi.h, but that's about it.


Can anyone point me to some documentation?


it looks like there aren't man pages for these functions, just the 
function comment blocks

in the code, and their usage internally:

/**
 * Get the list of pools in a filesystem.
 * \param name        filesystem name or path
 * \param poollist    caller-allocated array of char*
 * \param list_size

[lustre-discuss] llapi documentation

2022-06-14 Thread John Bauer

I have had little success in my search for documentation on pool 
functions in llapi. I've looked in:


https://wiki.lustre.org/PFL2_High_Level_Design

https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace


I'm looking for info on llapi_get_poollist() and 
llapi_get_poolmembers().  I've found the prototype in 
/usr/include/lustre/lustreapi.h, but that's about it.


Can anyone point me to some documentation?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] llapi_layout_alloc

2022-06-06 Thread John Bauer

Is there any reason that a call to *llapi_layout_alloc()* would result 
in the following error messge?


*/nobackupp17/jbauer2/dd.out has no stripe info*

In the code snippet below, I get the first print "on entry", followed by 
the above error message, but I don't get the print after the call to 
llapi_layout_alloc().


I'm using pointers to functions from dlopen()/dlsym() as this is in a 
library that has to run on non-Lustre systems.  I'm quite confident that 
it is pointing to llapi_layout_alloc.


fprintf(stderr,"%s() on entry\n",__func__);
   struct llapi_layout *layout = (*llapi_layout_alloc_fcn)();
fprintf(stderr,"%s() %p=llapi_layout_alloc()\n",__func__,layout);
   if( layout == NULL ) return -1 ;
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread John Bauer


Pat,

No, not in  general.  It just seems that if one is storing data on an 
SSD it should be optional to have it not stored in memory ( why store in 
2 fast mediums ).


O_DIRECT is not of value as that would apply to all extents, whether on 
SSD on HDD.   O_DIRECT on Lustre has been problematic for me in the 
past, performance wise.


John

On 5/19/22 13:05, Patrick Farrell wrote:

No, and I'm not sure I agree with you at first glance.

Is this just generally an idea that data stored on SSD should not be 
in RAM?  If so, there's no mechanism for that other than using direct I/O.


-Patrick

*From:* lustre-discuss  on 
behalf of John Bauer 

*Sent:* Thursday, May 19, 2022 12:48 PM
*To:* lustre-discuss@lists.lustre.org 
*Subject:* [lustre-discuss] Avoiding system cache when using ssd pfl 
extent

When using PFL, and using an SSD as the first extent, it seems it would
be advantageous to not have that extent's file data consume memory in
the client's system buffers.  It would be similar to using O_DIRECT, but
on a per-extent basis.  Is there a mechanism for that already?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread John Bauer

When using PFL, and using an SSD as the first extent, it seems it would 
be advantageous to not have that extent's file data consume memory in 
the client's system buffers.  It would be similar to using O_DIRECT, but 
on a per-extent basis.  Is there a mechanism for that already?


Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] darshan-discuss

2022-04-28 Thread John Bauer

Since there seems to be considerable overlap between lustre and darshan 
users I thought I would ask here:  Is there an email list for darshan 
discussion analogous to lustre-discuss?


Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre-discuss Digest, Vol 191, Issue 2

2022-02-03 Thread John Bauer

 The following loop in wdfile.f90 is pointless as the write happens 
only once for each rank.


Each rank is writing out the array once and then closing the file.  If 
the size of array 'data' is not a multiple of the Lustre stripe size 
there is going to be a lot of read-modify-write going on.


 do ii = 0, size
 if ( rank == ii ) then
    !start= MPI_Wtime()
    write(unit=iounit) data(1:nx, 1:ny, 1:nz)
    close(iounit)
    !finish = MPI_Wtime()
    !write(6,'(i5,f7.4)') rank, finish - start
 else
 end if
  end do

On 2/3/2022 9:39 AM, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. RE-Fortran IO problem (Bertini, Denis Dr.)
2. Re: RE-Fortran IO problem (Patrick Farrell)
3. Re: RE-Fortran IO problem (Bertini, Denis Dr.)


--

Message: 1
Date: Thu, 3 Feb 2022 12:43:21 +
From: "Bertini, Denis Dr."
To:"lustre-discuss@lists.lustre.org"

Subject: [lustre-discuss] RE-Fortran IO problem
Message-ID:
Content-Type: text/plain; charset="iso-8859-1"

Hi,


Just as an add-on to my previous mail, the problem shows up also

with intel fortran  and it not specific to gnu fortran compiler.

So it seems to be linked to how the fortran IO is handled which

seems to be sub-optimal in cas of a Lustre filesystem.


I would be grateful if one can confirm/disconfirm  that.


Here again the access to the code i used for my benchmarks:


https://git.gsi.de/hpc/cluster/ci_ompi/-/tree/main/f/src


Best,

Denis


-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail:d.bert...@gsi.de

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1, 64291 Darmstadt, Germany,www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Gesch?ftsf?hrung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, J?rg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-- next part --
An HTML attachment was scrubbed...
URL:

--

Message: 2
Date: Thu, 3 Feb 2022 15:15:16 +
From: Patrick Farrell
To: "Bertini, Denis Dr.",
"lustre-discuss@lists.lustre.org"  
Subject: Re: [lustre-discuss] RE-Fortran IO problem
Message-ID:



Content-Type: text/plain; charset="utf-8"

Denis,

FYI, the git link you provided seems to be non-public - it asks for a GSI login.

Fortran is widely used for applications on Lustre, so it's unlikely to be a 
fortran specific issue.  If you're seeing I/O rates drop suddenly during? 
activity, rather than being reliably low for some particular operation, I would 
look to the broader Lustre system.  It may be suddenly extremely busy or there 
could be, eg, a temporary network issue - Assuming this is a system belonging 
to your institution, I'd check with your admins.

Regards,
Patrick

From: lustre-discuss  on behalf of Bertini, 
Denis Dr.
Sent: Thursday, February 3, 2022 6:43 AM
To:lustre-discuss@lists.lustre.org  
Subject: [lustre-discuss] RE-Fortran IO problem


Hi,


Just as an add-on to my previous mail, the problem shows up also

with intel fortran  and it not specific to gnu fortran compiler.

So it seems to be linked to how the fortran IO is handled which

seems to be sub-optimal in cas of a Lustre filesystem.


I would be grateful if one can confirm/disconfirm  that.


Here again the access to the code i used for my benchmarks:


https://git.gsi.de/hpc/cluster/ci_ompi/-/tree/main/f/src


Best,

Denis


-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail:d.bert...@gsi.de

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1, 64291 Darmstadt, Germany,www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Gesch?ftsf?hrung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, J?rg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-- next part --
An HTML attachment was scrubbed...

Re: [lustre-discuss] varying sequential read performance.

2018-04-05 Thread John Bauer


Rick,

Thanks for reply.  Also thanks to Patrick Farrell for making me rethink 
this.


I am coming to believe that it is an OSS issue.  Every time I run this 
job, the first pass of dd is slow, which I now attribute to all the OSSs
needing to initially read the data in from disk to OSS cache.  If the 
subsequent passes of dd get back soon enough I then observe good 
performance.

If not, performance goes back to initial rates.

Inspecting the individual wait times for each of the dd reads for one of 
the poor performing dd passes,
and correlating them to the OSC that fulfills each individual dd read, I 
see that 75% of the wait time is from the a single OSC.  I suspect that 
this OSC is

using an OSS that is under heavier load.

I don't have access to the OSS so I cant report on the Lustre settings.  
I think the client side max cached is 50% of memory.


After speaking with Doug Petesch of Cray,  I though I would look into 
numa effects on this job.  I now also monitor the contents of

*/sys/devices/system/node/node?/meminfo *
and ran the job with *numactl --cpunodebind=0*
Interestingly enough, I now sometimes get dd transfer rates of 
2.2GiB/s.  Plotting the .../node?/meminfo[FilePages] value versus time 
for the 2 cpunodes shows that the
data is now mostly placed on node0.  Unfortunately, the variable rates 
still remain, as one would expect if it is an OSS caching issue, but the 
poor performance is also better.


Resulting plot with all *numactl --cpunodebind=0*


Resulting plots with *numactl --cpunodebind=x* where x alternates 
between 0 and 1 for each subsequent dd pass.  And indeed, the file pages 
migrate between cpunodes.




On 4/5/2018 9:22 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

John,

I had a couple of thoughts (though not sure if they are directly relevant to 
your performance issue):

1) Do you know what caching settings are applied on the lustre servers?  This 
could have an impact on performance, especially if your tests are being run 
while others are doing IO on the system.

2) It looks like there is a parameter called llite..max_cached_mb that 
controls how much client side data is cached.  According to the manual, the default 
value is 3/4 of the host’s RAM (which would be 48GB in your case).  I don’t know why 
the cache seems to be used unevenly between your 4 OSTs, but it might explain why the 
cache for some OSTs decrease when others increase.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu



On Apr 2, 2018, at 8:06 PM, John Bauer <bau...@iodoctors.com> wrote:

I am running dd 10 times consecutively to  read a 64GB file ( stripeCount=4 
stripeSize=4M ) on a Lustre client(version 2.10.3) that has 64GB of memory.
The client node was dedicated.

for pass in 1 2 3 4 5 6 7 8 9 10
do
of=/dev/null if=${file} count=128000 bs=512K
done

Instrumentation of the I/O from dd reveals varying performance.  In the plot 
below, the bottom frame has wall time
on the X axis, and file position of the dd reads on the Y axis, with a dot 
plotted at the wall time and starting file position of every read.
The slopes of the lines indicate the data transfer rate, which vary from 
475MB/s to 1.5GB/s.  The last 2 passes have sharp breaks
in the performance, one with increasing performance, and one with decreasing 
performance.

The top frame indicates the amount of memory used by each of the file's 4 OSCs 
over the course of the 10 dd runs.  Nothing terribly odd here except that
one of the OSC's eventually has its entire stripe ( 16GB ) cached and then 
never gives any up.

I should mention that the file system has 320 OSTs.  I found LU-6370 which 
eventually started discussing LRU management issues on systems with high
numbers of OST's leading to reduced RPC sizes.

Any explanations for the varying performance?
Thanks,
John


--
I/O Doctors, LLC
507-766-0378

bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] varying sequential read performance.

2018-04-03 Thread John Bauer


Colin

Since I do not have root privileges on the system, I do not have access 
to dropcache.  So, no, I do not flush cache between the dd runs.  The 10 
dd runs were done in a single
job submission and the scheduler does dropcache between jobs, so the 
first of the dd passes does start with a virgin cache.  What strikes me 
odd about this is the first dd
run is the slowest and obviously must read all the data from the OSSs, 
which is confirmed by the plot I have added to the top, which indicates 
the total amount of data moved
via lnet during the life of each dd process.  Notice that the second dd 
run, which lnetstats indicates also moves the entire 64 GB file from the 
OSSs, is 3 times faster, and has
to work with a non-virgin cache.  Runs 4 through 10 all move only 48GB 
via lnet because one of the OSCs keeps its entire 16GB that is needed in 
cache across all the runs.
Even with the significant advantage that runs 4-10 have, you could never 
tell in the dd results.  Run 5 is slightly faster than run 2, and run 7 
is as slow as run 0.


John




On 4/3/2018 12:20 AM, Colin Faber wrote:

Are you flushing cache between test runs?

On Mon, Apr 2, 2018, 6:06 PM John Bauer <bau...@iodoctors.com 
<mailto:bau...@iodoctors.com>> wrote:


I am running dd 10 times consecutively to  read a 64GB file (
stripeCount=4 stripeSize=4M ) on a Lustre client(version 2.10.3)
that has 64GB of memory.
The client node was dedicated.

*for pass in 1 2 3 4 5 6 7 8 9 10
do
   of=/dev/null if=${file} count=128000 bs=512K
done
*
Instrumentation of the I/O from dd reveals varying performance. 
In the plot below, the bottom frame has wall time
on the X axis, and file position of the dd reads on the Y axis,
with a dot plotted at the wall time and starting file position of
every read.
The slopes of the lines indicate the data transfer rate, which
vary from 475MB/s to 1.5GB/s.  The last 2 passes have sharp breaks
in the performance, one with increasing performance, and one with
decreasing performance.

The top frame indicates the amount of memory used by each of the
file's 4 OSCs over the course of the 10 dd runs. Nothing terribly
odd here except that
one of the OSC's eventually has its entire stripe ( 16GB ) cached
and then never gives any up.

I should mention that the file system has 320 OSTs.  I found
LU-6370 which eventually started discussing LRU management issues
on systems with high
numbers of OST's leading to reduced RPC sizes.

Any explanations for the varying performance?
Thanks,
John

-- 
I/O Doctors, LLC

507-766-0378
bau...@iodoctors.com <mailto:bau...@iodoctors.com>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] varying sequential read performance.

2018-04-02 Thread John Bauer

I am running dd 10 times consecutively to  read a 64GB file ( 
stripeCount=4 stripeSize=4M ) on a Lustre client(version 2.10.3) that 
has 64GB of memory.

The client node was dedicated.

*for pass in 1 2 3 4 5 6 7 8 9 10
do
   of=/dev/null if=${file} count=128000 bs=512K
done
*
Instrumentation of the I/O from dd reveals varying performance.  In the 
plot below, the bottom frame has wall time
on the X axis, and file position of the dd reads on the Y axis, with a 
dot plotted at the wall time and starting file position of every read.
The slopes of the lines indicate the data transfer rate, which vary from 
475MB/s to 1.5GB/s.  The last 2 passes have sharp breaks
in the performance, one with increasing performance, and one with 
decreasing performance.


The top frame indicates the amount of memory used by each of the file's 
4 OSCs over the course of the 10 dd runs.  Nothing terribly odd here 
except that
one of the OSC's eventually has its entire stripe ( 16GB ) cached and 
then never gives any up.


I should mention that the file system has 320 OSTs.  I found LU-6370 
which eventually started discussing LRU management issues on systems 
with high

numbers of OST's leading to reduced RPC sizes.

Any explanations for the varying performance?
Thanks,
John

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] sudden read performance drop on sequential forward read.

2017-08-31 Thread John Bauer


All,

I have an application that writes a 100GB file forwards, and then begins 
a sequence of reading a 70 GB section of the file forwards and 
backwards. At some point in the run,
not always at the same point, the read performance degrades 
significantly.  The initial forward reads are about 1.3 GB/s.  The 
backwards reads about 300 MB/s.  In an instant,
the forward read performance drops to 2.8 MB/s.  From about 250 seconds 
on, this is the only file that is being read or written by the 
application, running on a dedicated client node.
The file has a stripe count of 4, and stripe size of 512KB.    If the 
stripe count is changed to 1, this behavior does not present itself.  
The cpu usage is minimal during the period of degraded performance.
The LNET traffic is also about 2.8 MB/s during the period of degraded 
performance.  The system has 64GB of memory, meaning Lustre can not 
cache the entire 70GB active set of the file that is being read.

The Lustre client version is 2.9.0.

Any ideas what could be causing this?  What should I be watching in the 
/proc/fs/lustre file system to find some clues?


The behavior is depicted in the image below, which shows the file 
position as a function of wall clock time.  The writes and reads are of 
size 512KB.


Thanks,

John



--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] question about /proc/fs/lustre/osc/ and llapi functions.

2017-04-12 Thread John Bauer


Andreas

First, thanks for the response.  Second,  I looked at that at one time, 
seeming the logical answer, but I must have misread/mistyped something.  
My apologies.


Thanks gain
John

On 4/12/2017 3:34 AM, Dilger, Andreas wrote:

On Apr 7, 2017, at 18:06, John Bauer <bau...@iodoctors.com> wrote:

In /proc/fs/lustre/osc/ is an entry for every osc of all the lustre files 
systems on a client node of the form nbp9-OST0124-osc-880ffaa4bc00

My I/O instrumentation library tracks OSC's associated with an application file 
by reading the files in the directory for each OSC the application file is 
striped on.
I am trying to avoid the use of opendir() and readdir() to find the OSC entry 
of interest as there are 1284 entries in the lustre/osc directory, and I do 
this for hundreds of application files on hundreds of ranks.
I would like to generated the names of the osc entries I need, on the fly, 
given the discoverable file system name and osc indices.

I can find the file system name, nbp9, with llapi_getname().  I can generate 
the OST part with the osc index from llapi_get_stripe().
My question is, where do I find the 880ffaa4bc00 that is part of the 
directory entry for each OSC?  The value would appear to be file system
related, as each OSC associated with a given file system has the same value.  
OSC's of a different file system have a different value. Is there an 
llapi_() call that will get me this value.


There is llapi_getname() which will return the fsname and the instance 
identifier.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] question about /proc/fs/lustre/osc/ and llapi functions.

2017-04-07 Thread John Bauer

In /proc/fs/lustre/osc/ is an entry for every osc of all the lustre 
files systems on a client node of the form 
*nbp9-OST0124-osc-880ffaa4bc00


*My I/O instrumentation library tracks OSC's associated with an 
application file by reading the files in the directory for each OSC the 
application file is striped on.
I am trying to avoid the use of opendir() and readdir() to find the OSC 
entry of interest as there are 1284 entries in the lustre/osc directory, 
and I do this for hundreds of application files on hundreds of ranks.
I would like to generated the names of the osc entries I need, on the 
fly, given the discoverable file system name and osc indices.


I can find the file system name, *nbp9*, with llapi_getname().  I can 
generate the *OST* part with the osc index from llapi_get_stripe().
My question is, where do I find the *880ffaa4bc00 *that is part of 
the directory entry for each OSC?  The value would appear to be file system
related, as each OSC associated with a given file system has the same 
value.  OSC's of a different file system have a different value.

Is there an llapi_() call that will get me this value.

Thanks

John


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre OSC and system cache

2016-12-12 Thread John Bauer


Andreas

The file system has lru_max_age=900.  I have been googling around to 
find out what this controls, but haven't found much.  Is there 
documentation on how the memory management works with Lustre?  I wonder 
what the lru actually means.  How is it that 2 files on the same node 
are not controlled by the same lru mechanism, as SCR300's pages are 
being lru'ed out when they are clearly used more recently than any in 
SCRATCH?


Thanks

John


On 12/12/2016 6:59 PM, Dilger, Andreas wrote:

On Dec 12, 2016, at 15:50, John Bauer <bau...@iodoctors.com> wrote:

I'm observing some undesirable caching of OSC data in the system buffers.  This 
is a single node, single process application.  There are 2 files of interest, 
SCRATCH and SCR300,  both are scratch files with stripeCount=4.  The system has 
128GB of memory.  Lustre maxes out at about 59GB of memory used for caching.
SCRATCH,  About 22GB is written/read during the first 300 seconds of the run.  
No further activity to the file ( but remains open ) until about 18,700 seconds 
into the run when another 22GB is written/read.  Illustrated in the top frame 
of the first plot below.  In the bottom frame of the first plot is the amount 
of system cache used by each of the 4 OSC's associated with the file over the 
course of the run ( nearly identical, as would be expected ).  Note that each 
the OSC's retains its 5.5GB of memory even though nothing is happening to the 
file.
SCR300,  A 110GB file, written and repeatedly read between the times of the 
above SCRATCH file's I/O.

What is of interest it that while SCR300 is doing all its I/O, and its 
associated OSC's are fighting each other for caching memory, the 4 OSC's for 
the inactive file(SCRATCH) retain their 22GB of memory.  Why are the 4 OSC's 
for the inactive file exempt from giving up their memory?  It is very 
reproducible.

You don't mention what Lustre version you are using, which makes it hard
to comment specifically.  That said, you could try reducing the lock LRU
age, which was changed by default in the 2.8 or 2.9 release to 3900s
(65 minutes) instead of 36000s (10h) via:

 lctl set_param ldlm.namespaces.*.lru_max_age=390

(though check what your current setting is, since the units are in
"jiffies" (HZ) and that may differ depending on kernel compile options).

Cheers, Andreas


The application is MSC.Nastran, which has the capability to put the data for 
SCR300 inside of SCRATCH(increasing its size to 132GB).  If run in this mode, 
the caching behavior is much better behaved and the job runs in 11,500 seconds, 
versus 19,000.  Illustrated in 3rd plot below.  While this is a solution for 
this case, it is not a general solution.

Thanks

John
Plots for SCRATCH



Plots for SCR300




Plots for SCR300 inside of SCRATCH


--
I/O Doctors, LLC
507-766-0378

bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] lustre OSC and system cache

2016-12-12 Thread John Bauer


Andreas

Realized I forgot the version after I sent the email.  From 
/proc/fs/lustre/version


build: 
2.5.2-trunk-1.0502.20758.2.7-abuild-RB-5.2UP04_2.5.2@20758-2015-09-01-23:29


I'll get in touch with the admin on the lru_max_age.

Thanks

John


On 12/12/2016 6:59 PM, Dilger, Andreas wrote:

On Dec 12, 2016, at 15:50, John Bauer <bau...@iodoctors.com> wrote:

I'm observing some undesirable caching of OSC data in the system buffers.  This 
is a single node, single process application.  There are 2 files of interest, 
SCRATCH and SCR300,  both are scratch files with stripeCount=4.  The system has 
128GB of memory.  Lustre maxes out at about 59GB of memory used for caching.
SCRATCH,  About 22GB is written/read during the first 300 seconds of the run.  
No further activity to the file ( but remains open ) until about 18,700 seconds 
into the run when another 22GB is written/read.  Illustrated in the top frame 
of the first plot below.  In the bottom frame of the first plot is the amount 
of system cache used by each of the 4 OSC's associated with the file over the 
course of the run ( nearly identical, as would be expected ).  Note that each 
the OSC's retains its 5.5GB of memory even though nothing is happening to the 
file.
SCR300,  A 110GB file, written and repeatedly read between the times of the 
above SCRATCH file's I/O.

What is of interest it that while SCR300 is doing all its I/O, and its 
associated OSC's are fighting each other for caching memory, the 4 OSC's for 
the inactive file(SCRATCH) retain their 22GB of memory.  Why are the 4 OSC's 
for the inactive file exempt from giving up their memory?  It is very 
reproducible.

You don't mention what Lustre version you are using, which makes it hard
to comment specifically.  That said, you could try reducing the lock LRU
age, which was changed by default in the 2.8 or 2.9 release to 3900s
(65 minutes) instead of 36000s (10h) via:

 lctl set_param ldlm.namespaces.*.lru_max_age=390

(though check what your current setting is, since the units are in
"jiffies" (HZ) and that may differ depending on kernel compile options).

Cheers, Andreas


The application is MSC.Nastran, which has the capability to put the data for 
SCR300 inside of SCRATCH(increasing its size to 132GB).  If run in this mode, 
the caching behavior is much better behaved and the job runs in 11,500 seconds, 
versus 19,000.  Illustrated in 3rd plot below.  While this is a solution for 
this case, it is not a general solution.

Thanks

John
Plots for SCRATCH



Plots for SCR300




Plots for SCR300 inside of SCRATCH


--
I/O Doctors, LLC
507-766-0378

bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] lustre OSC and system cache

2016-12-12 Thread John Bauer

I'm observing some undesirable caching of OSC data in the system 
buffers.  This is a single node, single process application. There are 2 
files of interest, *SCRATCH *and *SCR300*, both are scratch files with 
stripeCount=4.  The system has 128GB of memory.  Lustre maxes out at 
about 59GB of memory used for caching.


*SCRATCH*,  About 22GB is written/read during the first 300 seconds of 
the run.  No further activity to the file ( but remains open ) until 
about 18,700 seconds into the run when another 22GB is written/read.  
Illustrated in the top frame of the first plot below.  In the bottom 
frame of the first plot is the amount of system cache used by each of 
the 4 OSC's associated with the file over the course of the run ( nearly 
identical, as would be expected ).  Note that each the OSC's retains its 
5.5GB of memory even though nothing is happening to the file.


*SCR300*,  A 110GB file, written and repeatedly read between the times 
of the above SCRATCH file's I/O.


What is of interest it that while SCR300 is doing all its I/O, and its 
associated OSC's are fighting each other for caching memory, the 4 OSC's 
for the inactive file(SCRATCH) retain their 22GB of memory.  Why are the 
4 OSC's for the inactive file exempt from giving up their memory?  It is 
very reproducible.


The application is MSC.Nastran, which has the capability to put the data 
for SCR300 inside of SCRATCH(increasing its size to 132GB).  If run in 
this mode, the caching behavior is much better behaved and the job runs 
in 11,500 seconds, versus 19,000. Illustrated in 3rd plot below.  While 
this is a solution for this case, it is not a general solution.


Thanks

John

Plots for *SCRATCH*



Plots for *SCR300*




Plots for *SCR300 *inside of *SCRATCH*


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre striping and MPI

2016-10-28 Thread John Bauer


Andreas

Thanks for the reply.  I should have clarified that my setting 
*stripe_count=rank* was purely for debugging purposes so I could tell 
which of the 4 ranks actually did set the striping when the test case 
failed. Normally, the stripe_count is user selectable, and would be the 
same for all ranks.  I was hoping that the first rank to get to the 
open/set stripe would do what's needed and the later arriving ranks 
would just open the already existing striped file.  It doesn't matter 
which rank gets there first as they all would be requesting the same 
striping.


There are several reasons that llapi_file_open() does not satisfy my 
needs.  Most notably, when my I/O library intercepts ( using LD_PRELOAD 
) functions such as the mkstemps() family, and some of the stdio opens, 
I can't necessarily replicate the open that would have occurred.  This 
has been discussed at length already on lustre-discuss.


Thanks again,

John


On 10/26/2016 5:54 PM, Dilger, Andreas wrote:


Doing all of the ioctl() handling directly in your application is not 
a great idea, as that will not allow using a bunch of new features 
that are in the pipeline (e.g. progressive file layouts, file level 
redundancy, etc).  It would be a lot better to use the provided 
llapi_file_create() or llapi_layout_*() to isolate your application 
from the underlying implementation of how the file layout is set.


Specifics about your implementation:

- it is only possible to set the layout on a file once, when it is 
first created, so doing this from multiple threads for a single shared 
file is broken.  You should do that only from rank 0.


- it is possible to create a separate file for each thread/rank, but 
you probably don't want to set the stripe *count* == rank for each 
file.  it doesn't make sense to create a bunch of different files for 
the same application, each one with a different stripe count.  You 
probably meant to set the stripe_offset == rank so that the load is 
spread evenly across all OSTs?


- as a caveat for the above, specifying the OST index directly == rank 
can cause problems, compared to just allowing the MDT to select the 
OST indices for each file itself.  If num_ranks < ost_count then only 
the first num_ranks OSTs would ever be used, and space usage on the 
OSTs would be imbalanced. Also, if some OST is offline or overloaded 
your application would not be able to create new files, while this can 
be avoided by allowing the MDT to select the OST index for each file.  
With one file per rank it is best to use stripe_count = 1 for all 
files, since you already have parallelism at the application level.


Cheers, Andreas

--

Andreas Dilger

Lustre Principal Architect

Intel High Performance Data Division

On 2016/10/26, 06:51, " John Bauer" <bau...@iodoctors.com 
<mailto:bau...@iodoctors.com>> wrote:


All

I am running a4 rank MPI job where all the ranks do an open of the 
file, attempt to set the striping with ioctl() and then do a small 
write. Intermittently, I get errors on the write() and ioctl().  This 
is a synthetic test case, boiled down from a much larger real world 
job.  Note that I set the stripe_count to rank+1 so I can tell which 
of the ranks actually set the striping.


I have determined that I only get the write failure when the ioctl 
also failed with "No data available".  It also strikes me that at 
most, only one rank reports "File exists".  With a 4 rank job, I would 
think that normal behavior would be 1 rank would work as expected ( no 
error ) and the other 3 would report file exists.


Is this expected behavior?

rank=1 doIO() -1=ioctl(fd=9) No data available
rank=1 doIO() -1=write(fd=9) Bad file descriptor
rank=3 doIO() -1=ioctl(fd=9) File exists

oflags = O_CREAT|O_TRUNC|O_RDWR

void
doIO(const char *fileName, int rank){
int status ;
   int fd=open(fileName, O_RDWR|O_TRUNC|O_CREAT|O_LOV_DELAY_CREATE, 
0640 ) ;

   if( fd < 0 ) return ;

   struct lov_user_md opts = {0};
   opts.lmm_magic = LOV_USER_MAGIC;
   opts.lmm_stripe_size= 1048576;
   opts.lmm_stripe_offset  = -1 ;
   opts.lmm_stripe_count   = rank+1 ;
   opts.lmm_pattern= 0 ;

   status = ioctl ( fd , LL_IOC_LOV_SETSTRIPE, );
   if(status<0)fprintf(stderr,"rank=%d %s() %d=ioctl(fd=%d) 
%s\n",rank,__func__,status,fd,strerror(errno));


   char *string = "this is it\n" ;
   int nc = strlen(string) ;
   status = write( fd, string, nc ) ;
   if( status != nc ) fprintf(stderr,"rank=%d %s() %d=write(fd=%d) 
%s\n",rank,__func__,status,fd,status<0?strerror(errno):"");

   status = close(fd) ;
   if(status<0)fprintf(stderr,"rank=%d %s() %d=close(fd=%d) 
%s\n",rank,__func__,status,fd,strerror(errno));

}

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com <mailto:bau...@iodoctors.com>


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Lustre striping and MPI

2016-10-26 Thread John Bauer


All

I am running a4 rank MPI job where all the ranks do an open of the file, 
attempt to set the striping with ioctl() and then do a small write.  
Intermittently, I get errors on the write() and ioctl(). This is a 
synthetic test case, boiled down from a much larger real world job.  
Note that I set the stripe_count to rank+1 so I can tell which of the 
ranks actually set the striping.


I have determined that I only get the write failure when the ioctl also 
failed with "No data available".  It also strikes me that at most, only 
one rank reports "File exists".  With a 4 rank job, I would think that 
normal behavior would be 1 rank would work as expected ( no error ) and 
the other 3 would report file exists.


Is this expected behavior?

rank=1 doIO() -1=ioctl(fd=9) No data available
rank=1 doIO() -1=write(fd=9) Bad file descriptor
rank=3 doIO() -1=ioctl(fd=9) File exists

oflags = O_CREAT|O_TRUNC|O_RDWR

void
doIO(const char *fileName, int rank){
int status ;
   int fd=open(fileName, O_RDWR|O_TRUNC|O_CREAT|O_LOV_DELAY_CREATE, 
0640 ) ;

   if( fd < 0 ) return ;

   struct lov_user_md opts = {0};
   opts.lmm_magic = LOV_USER_MAGIC;
   opts.lmm_stripe_size= 1048576;
   opts.lmm_stripe_offset  = -1 ;
   opts.lmm_stripe_count   = rank+1 ;
   opts.lmm_pattern= 0 ;

   status = ioctl ( fd , LL_IOC_LOV_SETSTRIPE, );
   if(status<0)fprintf(stderr,"rank=%d %s() %d=ioctl(fd=%d) 
%s\n",rank,__func__,status,fd,strerror(errno));


   char *string = "this is it\n" ;
   int nc = strlen(string) ;
   status = write( fd, string, nc ) ;
   if( status != nc ) fprintf(stderr,"rank=%d %s() %d=write(fd=%d) 
%s\n",rank,__func__,status,fd,status<0?strerror(errno):"");

   status = close(fd) ;
   if(status<0)fprintf(stderr,"rank=%d %s() %d=close(fd=%d) 
%s\n",rank,__func__,status,fd,strerror(errno));

}

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre on ZFS pooer direct I/O performance

2016-10-14 Thread John Bauer

Patrick
I thought at one time there was an inode lock held for the duration of the 
direct I/O read or write. So that even if one had multiple application threads 
writing direct, only one was "in flight" at a time. Has that changed?
John

Sent from my iPhone

> On Oct 14, 2016, at 3:16 PM, Patrick Farrell  wrote:
> 
> Sorry, I phrased one thing wrong:
> I said "transferring to the network", but it's actually until it's received 
> confirmation the data has been received successfully, I believe.
> 
> In any case, only one I/O (per thread) can be outstanding at a time with 
> direct I/O.
>  
> From: lustre-discuss  on behalf of 
> Patrick Farrell 
> Sent: Friday, October 14, 2016 3:12:22 PM
> To: Riccardo Veraldi; lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] Lustre on ZFS pooer direct I/O performance
>  
> Riccardo,
> 
> While the difference is extreme, direct I/O write performance will always be 
> poor.  Direct I/O writes cannot be asynchronous, since they don't use the 
> page cache.  This means Lustre cannot return from one write (and start the 
> next) until it has finished transferring the data to the network.
> 
> This means you can only have one I/O in flight at a time.  Good write 
> performance from Lustre (or any network filesystem) depends on keeping a lot 
> of data in flight at once.
> 
> What sort of direct write performance were you hoping for?  It will never 
> match that 800 MB/s from one thread you see with buffered I/O.
> 
> - Patrick
>  
> From: lustre-discuss  on behalf of 
> Riccardo Veraldi 
> Sent: Friday, October 14, 2016 2:22:32 PM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre on ZFS pooer direct I/O performance
>  
> Hello,
> 
> I would like how may I improve the situation of my lustre cluster.
> 
> I have 1 MDS and 1 OSS with 20 OST defined.
> 
> Each OST is a 8x Disks RAIDZ2.
> 
> A single process write performance is around 800MB/sec
> 
> anyway if I force direct I/O, for example using oflag=direct in dd, the 
> write performance drop as low as 8MB/sec
> 
> with 1MB block size. And each write it's about 120ms latency.
> 
> I used these ZFS settings
> 
> options zfs zfs_prefetch_disable=1
> options zfs zfs_txg_history=120
> options zfs metaslab_debug_unload=1
> 
> i am quite worried for the low performance.
> 
> Any hints or suggestions that may help me to improve the situation ?
> 
> 
> thank you
> 
> 
> Rick
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] fiemap problems

2016-07-22 Thread John Bauer

I am experiencing an intermittent problem with fiemap on Lustre. This is 
running on pleiades NASA Ames.


lustre: 2.7.1
kernel: 3.0.101-68.1.20160209-nasa
build:  2.7.1-3nasC_mofed31v5

I create a file with *dd if=/dev/zero of=${FILE} count=100 bs=1M* and 
then run my program to do the fcntl call to get the fiemap. About 1 out 
of 10 times the contents of the fiemap has an extra extent in it. The 
bad extent info is only generated when run immediately after the file is 
created by dd.  All subsequent runs of my simple program report the 
correct extent info.  Any ideas?


Thanks, John

incorrect immediately after creation
listExtents() fe_device=132 fe_length=2097152
listExtents() fe_device=132 fe_length=12582912
listExtents() fe_device=105 fe_length=8388608
listExtents() fe_device=105 fe_length=6291456
listExtents() fe_device=36 fe_length=12582912
listExtents() fe_device=48 fe_length=4194304
listExtents() fe_device=48 fe_length=8388608
listExtents() fe_device=47 fe_length=2097152
listExtents() fe_device=47 fe_length=10485760
listExtents() fe_device=201 fe_length=1048576
listExtents() fe_device=201 fe_length=11534336
*listExtents() fe_device=59 fe_length=4194304
listExtents() fe_device=59 fe_length=6291456
listExtents() fe_device=59 fe_length=8388608*
listExtents() fe_device=218 fe_length=12582912
listExtents() nMapped=15 byteCount=49056

correct: all subsequent runs

listExtents() fe_device=132 fe_length=2097152
listExtents() fe_device=132 fe_length=12582912
listExtents() fe_device=105 fe_length=8388608
listExtents() fe_device=105 fe_length=6291456
listExtents() fe_device=36 fe_length=12582912
listExtents() fe_device=48 fe_length=4194304
listExtents() fe_device=48 fe_length=8388608
listExtents() fe_device=47 fe_length=2097152
listExtents() fe_device=47 fe_length=10485760
listExtents() fe_device=201 fe_length=1048576
listExtents() fe_device=201 fe_length=11534336
*listExtents() fe_device=59 fe_length=4194304
listExtents() fe_device=59 fe_length=8388608*
listExtents() fe_device=218 fe_length=12582912
listExtents() nMapped=14 byteCount=104857600



  int rc = ioctl(info->fd, FS_IOC_FIEMAP, (unsigned long) fiemap);
  if (rc < 0) {
 fprintf(stderr,"ioctl(FS_IOC_FIEMAP) error %s\n",strerror(errno));
 return -1 ;
  }

  listExtents( fiemap ) ;


int
listExtents( struct fiemap *fiemap ){
int i ;
   int nMapped = fiemap->fm_mapped_extents ;
   long long byteCount = 0 ;
   for(i=0;ifm_extents+i;
  byteCount += cur->fe_length ;
  fprintf(stderr,"%s() fe_device=%d 
fe_length=%lld\n",__func__,cur->fe_device,cur->fe_length);

   }
   fprintf(stderr,"%s() nMapped=%d 
byteCount=%lld\n",__func__,nMapped,byteCount);




--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] llapi_file_get_stripe() and /proc/fs/lustre/osc restated

2016-07-18 Thread John Bauer


Robert

Excellent. It works like a champ, after changing llapi_get_name() to 
llapi_getname().


Thanks much.  If we meet up at a conference some day, I owe you dinner.

John


On 7/18/2016 6:13 PM, Read, Robert wrote:

Hi John,

The initial string in the OSC name is the filesystem name and the long 
hexadecimal number at the end is the client ID. The client ID is 
specific to the mount point that “owns” the OSC, and typically changes 
each time the filesystem is mounted. It is there to differentiate 
devices  when the same filesystem is mounted multiple times.


You can retrieve both of the values by passing the mount directory to 
llapi_get_name(). This function returns a string with filesystem name 
and client ID in the format “FSNAME-ID". You can split the string on 
the ‘-‘ to extract those values, and then use them construct the right 
name for the current OSC being used for that file on that particular 
mountpoint.


Note, like the client ID, the OSC name is specific to that mount 
point. The actual OST name is just FSNAME-OST.


robert

On Jul 18, 2016, at 15:22 , John Bauer <bau...@iodoctors.com 
<mailto:bau...@iodoctors.com>> wrote:


I will restate the problem I am having with Lustre.

With my I/O instrumentation library, I want to use 
*llapi_file_get_stripe*() to find the OSTs that a file of interest is 
striped on and then monitor only those OST's using files in the 
directory /proc/fs/lustre/osc.  This needs to be done 
programmatically, and in a general sense, with potentially only a 
relative path name.


*llapi_file_get stripe*() yields
lmm_stripe_count:   1
lmm_stripe_size:1048576
lmm_pattern:1
lmm_layout_gen: 0
lmm_stripe_offset:  12

lmm_oi.oi_fid *0x**194aa*

*obdidx*   objid objid   group
*12* 31515860  0x1e0e4d40

If I use *obdidx=12=0xc *to find the OST in directory 
/proc/fs/lustre/osc, I get multiple OSTs as there are multiple file 
systems with an ost of index 12 ( note that obdidx is decimal and 
entries in */proc/fs/lustre/osc* are hexadecimal, so we are looking 
for OST000c ).


%ls -ld /proc/fs/lustre/osc/*OST000c*
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp1-OST000c-osc-88090509f000
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp2-OST000c-osc-881038061c00
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp6-OST000c-osc-88084c405400
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp7-OST000c-osc-8807a8e1d400
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp8-OST000c-osc-88078339b800
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp9-OST000c-osc-8807833a5400


So I need to figure out which directory entry applies to the OST of 
my file of interest.


I looked at the inode for clues.  I did an stat() of the file to get

dev_t st_dev=0xcc5d43c2
ino_t st_ino=0x2000311ed0194aa

I notice the *lov_user_md->oi_fid=0x0194aa,* populated by 
*llapi_file_get_stripe*(), is reflected in the lower part of *stat.st 
<http://stat.st>_ino=0x2000311ed0194aa*.  My question is, "Does the 
remainder of st_ino, *2000311ed*, give me any clue as to which OST I 
should use out of */proc/fs/lustre/osc*?"  The same question applies 
to the OST's objid=0x1e0e4d4and the file's st_dev=0xcc5d43c2.


Because I know a priori that the file is in the lov nbp2, I know I 
need to find 
/proc/fs/lustre/osc/*nbp2-OST000c-osc-881038061c00.***What does the
*881038061c00 *represent?It is the same value for all OST's in a 
given lov, so I am guessing it is lov related.*

*

There are over 1200 OST on the node, so I want to minimize the number 
that I instrument.
Any information that would shed some light on this would be greatly 
appreciated.

John
--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] llapi_file_get_stripe() and /proc/fs/lustre/osc restated

2016-07-18 Thread John Bauer


I will restate the problem I am having with Lustre.

With my I/O instrumentation library, I want to use 
*llapi_file_get_stripe*() to find the OSTs that a file of interest is 
striped on and then monitor only those OST's using files in the 
directory /proc/fs/lustre/osc.  This needs to be done programmatically, 
and in a general sense, with potentially only a relative path name.


*llapi_file_get stripe*() yields
lmm_stripe_count:   1
lmm_stripe_size:1048576
lmm_pattern:1
lmm_layout_gen: 0
lmm_stripe_offset:  12

lmm_oi.oi_fid *0x**194aa*

*obdidx* objid   objid   group
*12*31515860 0x1e0e4d40

If I use *obdidx=12=0xc *to find the OST in directory 
/proc/fs/lustre/osc, I get multiple OSTs as there are multiple file 
systems with an ost of index 12 ( note that obdidx is decimal and 
entries in */proc/fs/lustre/osc* are hexadecimal, so we are looking for 
OST000c ).


%ls -ld /proc/fs/lustre/osc/*OST000c*
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp1-OST000c-osc-88090509f000
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp2-OST000c-osc-881038061c00
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp6-OST000c-osc-88084c405400
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp7-OST000c-osc-8807a8e1d400
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp8-OST000c-osc-88078339b800
dr-xr-xr-x 2 root root 0 Jul 18 14:31 
/proc/fs/lustre/osc/nbp9-OST000c-osc-8807833a5400


So I need to figure out which directory entry applies to the OST of my 
file of interest.


I looked at the inode for clues.  I did an stat() of the file to get

dev_t st_dev=0xcc5d43c2
ino_t st_ino=0x2000311ed0194aa

I notice the *lov_user_md->oi_fid=0x0194aa,* populated by 
*llapi_file_get_stripe*(), is reflected in the lower part of 
*stat.st_ino=0x2000311ed0194aa*.  My question is, "Does the remainder of 
st_ino, *2000311ed*, give me any clue as to which OST I should use out 
of */proc/fs/lustre/osc*?" The same question applies to the OST's 
objid=0x1e0e4d4and the file's st_dev=0xcc5d43c2.


Because I know a priori that the file is in the lov nbp2, I know I need 
to find /proc/fs/lustre/osc/*nbp2-OST000c-osc-881038061c00.***What 
does the
*881038061c00 *represent?  It is the same value for all OST's in a 
given lov, so I am guessing it is lov related.*

*

There are over 1200 OST on the node, so I want to minimize the number 
that I instrument.
Any information that would shed some light on this would be greatly 
appreciated.

John

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] llapi_file_get_stripe() and /proc/fs/lustre/osc/ entries

2016-07-16 Thread John Bauer

I am using *llapi_file_get_stripe()* to get the ost indexes that a file 
is striped on.  That part is working fine. But there are multiple Lustre 
file systems on the node resulting in multiple **OST* *in the 
directory /proc/fs/lustre/osc.  Is there something in the *struct 
lov_user_ost_data* or *struct lov_user_md* that would indicate which of 
the following directories pertains to the file's OST ?


dr-xr-xr-x 2 root root 0 Jul 16 12:31 nbp1-OST-osc-880287ae4c00
dr-xr-xr-x 2 root root 0 Jul 16 12:31 nbp2-OST-osc-881034d99000
dr-xr-xr-x 2 root root 0 Jul 16 12:31 nbp6-OST-osc-881003cd7800
dr-xr-xr-x 2 root root 0 Jul 16 12:31 nbp7-OST-osc-880ffe051c00
dr-xr-xr-x 2 root root 0 Jul 16 12:31 nbp8-OST-osc-880ffe054c00
dr-xr-xr-x 2 root root 0 Jul 16 12:31 nbp9-OST-osc-880fcf179400

Thanks

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] more on lustre striping

2016-06-10 Thread John Bauer

To confirm the point that you can not intercept the open called by fopen 
by using LD_PRELOAD, I have written a simple test case. Note that the 
runtime linker never looks for open().  Only fopen()


*$ cat a.c*
#include 
#include 
#include 
#include 

int
main(int argc, char ** argv ){
   FILE *f = fopen("a", "r" ) ;
   fprintf(stderr,"f=%p\n",f);
   fclose(f);
}
*$ file a*
a: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically 
linked (uses shared libs), for GNU/Linux 2.6.32, 
BuildID[sha1]=dfe043b4ec8cf19d5fd3fab524d7c72ed1453574, not stripped

*$ cat a.csh*
#!/bin/csh
setenv LD_DEBUG all
./a >&! a.cpr
*$ ./a.csh*
*$ grep -i open a.cpr*
120584: symbol=fopen;  lookup in file=./a [0]
120584: symbol=fopen;  lookup in file=/lib64/libc.so.6 [0]
120584: binding file ./a [0] to /lib64/libc.so.6 [0]: normal 
symbol `fopen' [GLIBC_2.2.5]

*$*


On 6/10/2016 7:29 AM, Ashley Pittman wrote:

On 22/05/16 02:56, John Bauer wrote:


Oleg

I can intercept the fopen(), but that does me no good as I can't set 
the O_LOV_DELAY_CREATE bit.  What I can not intercept is the open() 
downstream of fopen().  If one examines the symbols in libc you will 
see there are no unsatisfied externals relating to open, which means 
there is nothing for the runtime linker to find concerning open's.  I 
will have a look at the Lustre 1.8 source, but I seriously doubt that 
the open beneath fopen() was intercepted with LD_PRELOAD.  I would 
love to find a way to do that.  I could throw away a lot of code. 
Thanks,  John




Could you not intercept fopen() and implement it with calls to open() 
and fdopen() yourself which would give you full control over what 
you're looking for here?


Ashley.


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] more on lustre striping

2016-05-21 Thread John Bauer


Oleg

I can intercept the fopen(), but that does me no good as I can't set the 
O_LOV_DELAY_CREATE bit.  What I can not intercept is the open() 
downstream of fopen().  If one examines the symbols in libc you will see 
there are no unsatisfied externals relating to open, which means there 
is nothing for the runtime linker to find concerning open's.  I will 
have a look at the Lustre 1.8 source, but I seriously doubt that the 
open beneath fopen() was intercepted with LD_PRELOAD.  I would love to 
find a way to do that.  I could throw away a lot of code. Thanks,  John


% nm -g /lib64/libc.so.6 | grep open
00033d70 T catopen
003bfb80 B _dl_open_hook
000b9a60 W fdopendir
0006b140 T fdopen@@GLIBC_2.2.5
000755c0 T fmemopen
0006ba00 W fopen64
0006bb60 T fopencookie@@GLIBC_2.2.5
0006ba00 T fopen@@GLIBC_2.2.5
000736f0 T freopen
00074b50 T freopen64
000ead40 T fts_open
0000 T iconv_open
0006b140 T _IO_fdopen@@GLIBC_2.2.5
00077220 T _IO_file_fopen@@GLIBC_2.2.5
00077170 T _IO_file_open
0006ba00 T _IO_fopen@@GLIBC_2.2.5
0006d1d0 T _IO_popen@@GLIBC_2.2.5
0006cee0 T _IO_proc_open@@GLIBC_2.2.5
00130b20 T __libc_dlopen_mode
000e7840 W open
000e7840 W __open
000ec690 T __open_2
000e7840 W open64
000e7840 W __open64
000ec6b0 T __open64_2
000e78d0 W openat
000e79b0 T __openat_2
000e78d0 W openat64
000e79b0 W __openat64_2
000f6e00 T open_by_handle_at
000340b0 T __open_catalog
000b9510 W opendir
000f0850 T openlog
00073e90 T open_memstream
000731b0 T open_wmemstream
0006d1d0 T popen@@GLIBC_2.2.5
0012fbd0 W posix_openpt
000e6460 T posix_spawn_file_actions_addopen
%

John


On 5/21/2016 7:33 PM, Drokin, Oleg wrote:

btw I find it strange that you cannot intercept fopen (and in fact intercepting 
every library call like that is counterproductive).

We used to have this "liblustre" library, that you an LD_PRELOAD into your 
application and it would work with Lustre even if you are not root and if Lustre is not 
mounted on that node
(and in fact even if the node is not Linux at all). That had no problems at all 
to intercept all sorts of opens by intercepting syscalls.
I wonder if you can intercept something deeper like sys_open or something like 
that?
Perhaps checkout lustre 1.8 sources (or even 2.1) and see how we did it back 
there?

On May 21, 2016, at 4:25 PM, John Bauer wrote:


Oleg

So in my simple test, the second open of the file caused the layout to be 
created.  Indeed, a write to the original fd did fail.
That complicates things considerably.

Disregard the entire topic.

Thanks

John


On 5/21/2016 3:08 PM, Drokin, Oleg wrote:

The thing is, when you open a file with no layout (the one you cteate with 
P_LOB_DELAY_CREATE) for write the next time -
the default layout is created just the same as it would have been created on 
the first open.
So if you want custom layouts - you do need to insert setstripe call between 
the creation and actual open for write.

On the other hand if you open with O_LOV_DELAY_CREATE and then try to write 
into that fd - you will get a failure.


On May 21, 2016, at 4:01 PM, John Bauer wrote:



Andreas,

Thanks for the reply.  For what it's worth, extending a file that does not have 
layout set does work.

% rm -f file.dat
% ./no_stripe.exe file.dat
fd=3
% lfs getstripe file.dat
file.dat has no stripe info
% date >> file.dat
% lfs getstripe file.dat
file.dat
lmm_stripe_count:   1
lmm_stripe_size:1048576
lmm_pattern:1
lmm_layout_gen: 0
lmm_stripe_offset:  21
 obdidx   objid   objid   group
 21 6143298   0x5dbd420

%
The LD_PRELOAD is exactly what I am doing in my I/O library.  Unfortunately, 
one can not intercept the open() that results from a call to fopen().  That 
open is hard linked to the open in libc and not satisfied by the runtime 
linker.  This is what is driving this topic for me. I can not conveniently set 
the striping for a file opened with fopen() and other functions where the open 
is called from inside libc. I used to believe that not too many application use 
stdio for heavy I/O, but I have been come across several recently.

John

On 5/21/2016 12:51 AM, Dilger, Andreas wrote:


This is probably getting to be more of a topic for lustre-devel.

There currently isn't any way to do what you ask, since (IIRC) it will cause an 
error for apps that try to write to the files before the layout is set.

What you could do is to create an LD_PRELOAD library to intercept the open() 
calls and set O_LOV_DELAY_CREATE and set the layout explicitly for each file. 
This might be a win if each file needs a different layout, but since it uses 
two RPCs per file it would be s

Re: [lustre-discuss] more on lustre striping

2016-05-21 Thread John Bauer


Oleg

So in my simple test, the second open of the file caused the layout to 
be created.  Indeed, a write to the original fd did fail.


That complicates things considerably.

Disregard the entire topic.

Thanks

John


On 5/21/2016 3:08 PM, Drokin, Oleg wrote:

The thing is, when you open a file with no layout (the one you cteate with 
P_LOB_DELAY_CREATE) for write the next time -
the default layout is created just the same as it would have been created on 
the first open.
So if you want custom layouts - you do need to insert setstripe call between 
the creation and actual open for write.

On the other hand if you open with O_LOV_DELAY_CREATE and then try to write 
into that fd - you will get a failure.


On May 21, 2016, at 4:01 PM, John Bauer wrote:


Andreas,

Thanks for the reply.  For what it's worth, extending a file that does not have 
layout set does work.

% rm -f file.dat
% ./no_stripe.exe file.dat
fd=3
% lfs getstripe file.dat
file.dat has no stripe info
% date >> file.dat
% lfs getstripe file.dat
file.dat
lmm_stripe_count:   1
lmm_stripe_size:1048576
lmm_pattern:1
lmm_layout_gen: 0
lmm_stripe_offset:  21
 obdidx   objid   objid   group
 21 6143298   0x5dbd420

%
The LD_PRELOAD is exactly what I am doing in my I/O library.  Unfortunately, 
one can not intercept the open() that results from a call to fopen().  That 
open is hard linked to the open in libc and not satisfied by the runtime 
linker.  This is what is driving this topic for me. I can not conveniently set 
the striping for a file opened with fopen() and other functions where the open 
is called from inside libc. I used to believe that not too many application use 
stdio for heavy I/O, but I have been come across several recently.

John

On 5/21/2016 12:51 AM, Dilger, Andreas wrote:

This is probably getting to be more of a topic for lustre-devel.

There currently isn't any way to do what you ask, since (IIRC) it will cause an 
error for apps that try to write to the files before the layout is set.

What you could do is to create an LD_PRELOAD library to intercept the open() 
calls and set O_LOV_DELAY_CREATE and set the layout explicitly for each file. 
This might be a win if each file needs a different layout, but since it uses 
two RPCs per file it would be slower than using the default layout.

Cheers, Andreas

On May 18, 2016, at 16:46, John Bauer <bau...@iodoctors.com> wrote:


Since today's topic seems to be Lustre striping, I will revisit a previous line 
of questions I had.

Andreas had put me on to O_LOV_DELAY_CREATE which I have been experimenting 
with. My question is : Is there a way to flag a directory with 
O_LOV_DELAY_CREATE so that a file created in that directory will be created 
with O_LOV_DELAY_CREATE also.  Much like a file can inherit a directory's 
stripe count and stripe size, it would be convenient if a file could also 
inherit O_LOV_DELAY_CREATE?  That way, for open()s that I can not intercept ( 
and thus can not set O_LOV_DELAY_CREATE in oflags) , such as those issued by 
fopen(), I can then get the fd with fileno() and set the striping with 
ioctl(fd, LL_IOC_LOV_SETSTRIPE, lum).

Thanks

John
--
I/O Doctors, LLC
507-766-0378

bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
I/O Doctors, LLC
507-766-0378

bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] more on lustre striping

2016-05-18 Thread John Bauer

Since today's topic seems to be Lustre striping, I will revisit a 
previous line of questions I had.


Andreas had put me on to O_LOV_DELAY_CREATE which I have been 
experimenting with. My question is : Is there a way to flag a directory 
with O_LOV_DELAY_CREATE so that a file created in that directory will be 
created with O_LOV_DELAY_CREATE also.  Much like a file can inherit a 
directory's stripe count and stripe size, it would be convenient if a 
file could also inherit O_LOV_DELAY_CREATE?  That way, for open()s that 
I can not intercept ( and thus can not set O_LOV_DELAY_CREATE in oflags) 
, such as those issued by fopen(), I can then get the fd with fileno() 
and set the striping withioctl(fd, LL_IOC_LOV_SETSTRIPE, lum).


Thanks

John

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Delaying Lustre striping until first extent

2016-05-03 Thread John Bauer

Has any there been any discussion as to allowing a user to modify the 
striping of a file until the first extent is made?  There are a lot of 
opens that can not be easily replaced with *llapi_file_open*(), such as 
*openat*() family, *mkstemp*() family, and *fopen*() family.


It seems that it should be feasible to change the file's striping even 
after the file is created but not written to.


It would be really handy to have a function *llapi_fd_set_stripe*( int 
fd, ... ) to set the striping for a file that has been opened but not 
written to.  I have gotten past my immediate need for setting the 
striping of a file opened with *fopen64*() by doing the 
*llapi_file_open*() and then using the resulting fd in *fdopen*().  But 
this approach is not applicable to all listed above.



Thanks, John

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] llapi_file_get_stripe() and lmm_stripe_offset

2016-04-30 Thread John Bauer

I have noticed some inconsistencies in the *lfs setstripe/getstripe 
*commands and *llapi_file_get_stripe(*) function.   Notice in the lfs 
setstripe/getstripe example below that specifying 7 for the offset with 
lfs setstripe -i 7 results in lfs getstripe reporting 
lmm_stripe_offset=7.  That all looks good.  But the simple program below 
that calls *llapi_file_get_stripe() *and prints out the *lmm_XXX* values 
reports *lmm_stripe_o**ffset**=0*. Doesn't seem right.


% lfs setstripe -c 4 -i *7* file.dat
% lfs getstripe file.dat
file.dat
lmm_stripe_count:   4
lmm_stripe_size:1048576
lmm_pattern:1
lmm_layout_gen: 0
lmm_stripe_offset: *7*
obdidx   objid   objid   group
 7 5611546   0x55a01a0
 0 5607846   0x5591a60
 8 5612725   0x55a4b50
16 5611434   0x559faa0


But whenever I call llapi_file_get_stripe(), the value 
*lmm_stripe_offset* that is returned in the *struct lov_user_md* is 
always 0.



% *./main.exe file.dat*
0=llapi_file_get_stripe()
lmm_stripe_count=4
lmm_stripe_size=1048576
lmm_pattern=1
lmm_layout_gen=0
lmm_stripe_offset=0
% *cat main.c*
#include 
#include 
#include 

#include "lustre/lustre_user.h"

main(int argc, char **argv){
   if( argc < 2 ) return -1 ;
   char *fileName = argv[1] ;
   struct lov_user_md_v3 lum ;
   memset( , 0, sizeof(lum) ) ;
   int status = llapi_file_get_stripe(fileName, );
   fprintf(stderr,"%d=llapi_file_get_stripe()\n",status);
   if( status != 0 ) return -1 ;
   fprintf(stderr,"lmm_stripe_count=%d\n",lum.lmm_stripe_count);
   fprintf(stderr,"lmm_stripe_size=%lld\n",lum.lmm_stripe_size);
   fprintf(stderr,"lmm_pattern=%d\n",lum.lmm_pattern);
   fprintf(stderr,"lmm_layout_gen=%d\n",lum.lmm_layout_gen);
fprintf(stderr,"lmm_stripe_offset=%d\n",(int)lum.lmm_stripe_offset);
   exit(0);
}


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] llapi_file_open() messages to stderr

2016-04-30 Thread John Bauer


Dennis

Thanks for the response.  There are other reasons I want to suppress 
these messages.


1) I am trying to minimize system calls.

2) It's kind of overkill for llapi_file_open() to print to stderr.  The 
function already returns -1 with errno=EEXIST. Shouldn't it be up to the 
caller to decide if a message should be printed.


3) There are other messages, different than the *stripe already set*,  
that I also would like to suppress.  Most notably "*Invalid argument*".  
I hate to think that I have to check validity of all the striping 
arguments to prevent the messages to stderr.  I'll also point that even 
*lfs setstripe* doesn't even go to the trouble to tell you which 
argument is invalid.


% lfs setstripe -c 486 -i 45 file.dat
error on ioctl 0x4008669a for 'file.dat' (3): Invalid argument
error: setstripe: create stripe file 'file.dat' failed
%

John


On 4/30/2016 12:27 PM, Dennis Nelson wrote:
Check for existing files before making the call?  And don't issue the 
call if the file exists.  You cannot change stripe attributes on an 
existing file.


Sent from my iPhone

On Apr 30, 2016, at 11:51 AM, John Bauer <bau...@iodoctors.com 
<mailto:bau...@iodoctors.com>> wrote:


I am implementing the use of*llapi_file_open()* in my I/O library so 
I can set the striping.  If the file already exists I get the 
following message on stderr:


*error on ioctl 0x4008669a for 'dd.dat' (4): stripe already set*

Is there some way to suppress this message?  The end-uers of my 
library will get these messages in stderr of their job and will 
likely have no clue as to why.  My library detects the failed 
llapi_file_open() and retrys with open64() and spits out its own 
error message to the library's log.


Thanks

John
--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread John Bauer


David

You note that you write a 6GB file.  I suspect that your Linux systems 
have significantly more memory than 6GB meaning your file will end being 
cached in the system buffers.  It wont matter how many OSTs you use as 
you probably are not measuring the speed to the OST's, but rather, you 
are measuring the memory copy speed.

What transfer rate are you seeing?

John

On 5/19/2015 10:40 AM, Schneider, David A. wrote:

I am trying to get good performance with parallel writing to one file through 
MPI. Our cluster has high performance when I write to separate files, but when 
I use one file - I see very little performance increase.

As I understand, our cluster defaults to use one OST per file. There are many 
OST's though, which is how we get good performance when writing to multiple 
files. I have been using the command

  lfs setstripe

to change the stripe count and block size. I can see that this works, when I do 
lfs getstripe, I see the output file is striped, but I'm getting very little 
I/O performance when I create the striped file.

When working from hdf5 and mpi, I have seen a number of references about tuning 
parameters, I haven't dug into this yet. I first want to make sure lustre has 
the high output performance at a basic level. I tried to write a C program uses 
simple POSIX calls (open and looping over writes) but I don't see much increase 
in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB 
file).

Does anyone know if this should work? What is the simplest C program I could 
write to see an increase in output performance after I stripe? Do I need 
separate processes/threads with separate file handles? I am on linux red hat 5. 
I'm not sure what version of lustre this is. I have skimmed through a 450 page 
pdf of lustre documentation, I saw references to destructive testing one does 
in the beginning, but I'm not sure what I can do now. I think this is the first 
work we've done to get high performance when writing a single file, so I'm 
worried there is something buried in the lustre configuration that needs to be 
changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should 
check?

best,

David Schneider
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

54 matches

Mail list logo