Re: [Lustre-discuss] Swap over lustre

2011-08-18 Thread Temple Jason
Hello,

I experimented with swap on lustre in as many ways as possible (without 
touching the code), and had the shortest path possible to no avail.  The code 
is not able to handle it at all, and the system always hung.

Without serious code rewrites, this isn't going to work for you.

-Jason

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of John Hanks
Sent: giovedì, 18. agosto 2011 05:55
To: land...@scalableinformatics.com
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Swap over lustre

On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
land...@scalableinformatics.com wrote:
 On 08/17/2011 10:43 PM, John Hanks wrote:
 As a rule of thumb, you should try to keep the path to swap as simple as
 possible.  No memory/buffer allocations on the way to a paging event if
 you can possibly do this.

I do have a long path there, will try simplifying that and see if it helps.

 The lustre client (and most NFS or even network block devices) all do
 memory allocation of buffers ... which is anathema to migrating pages
 out to disk.  You can easily wind up in a death spiral race condition
 (and it sounds like you are there).  You might be able to do something
 with iSCSI or SRP (though these also do block allocations and could
 trigger death spirals).  If you can limit the number of buffers they
 allocate, and then force them to allocate the buffers at startup (by
 forcing some activity to the block device, and then pin this memory so
 that they can't be ejected ...) you might have chance to do it as a
 block device.  I think SRP can do this, not sure if iSCSI initiators can
 pin buffers in ram.

 You might look at the swapz patches (we haven't integrated them into our
 kernel yet, but have been looking at it) to compress swap pages and
 store them ... in ram.  This may not work for you, but it could be an
 option.

I wasn't aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.

 Is there any particular reason you can't use a local drive for this
 (such as you don't have local drives, or they aren't big/fast enough)?

We're doing this on diskless nodes. I'm not looking to get a huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don't expect it to get heavily used, I'm
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I've never tried it with either
protocol.

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

jbh
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-18 Thread John Hanks
On Thu, Aug 18, 2011 at 12:36 AM, Temple  Jason jtem...@cscs.ch wrote:
 Hello,

 I experimented with swap on lustre in as many ways as possible (without 
 touching the code), and had the shortest path possible to no avail.  The code 
 is not able to handle it at all, and the system always hung.

 Without serious code rewrites, this isn't going to work for you.

 -Jason


Lacking the skills and time for what Andreas suggested, I think my
approach will be to abandon this direction for the moment. Thanks to
everyone for your responses. If I find time to learn how to provide
useful debugging and feedback, the list will be the first to know :)

jbh
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-18 Thread Andreas Dilger
On 2011-08-18, at 12:36 AM, Temple  Jason jtem...@cscs.ch wrote:
 I experimented with swap on lustre in as many ways as possible (without 
 touching the code), and had the shortest path possible to no avail.  The code 
 is not able to handle it at all, and the system always hung.

Jason, did you try the lloop device?  That was written for swap to use, to 
avoid the VFS, filesystem, and locking layers. It never made it to production 
quality, since no customer was interested to complete it, bit it is definitely 
the best starting point.

 Without serious code rewrites, this isn't going to work for you.

That's a difficult assessment to make.  A bunch of effort went into removing 
allocations in the IO path at one time, but it was never a priority to keep 
lloop working, so things may have regressed over time.

IMHO it probably isn't a huge effort to get this working again, but someone in 
thr community would need to invest the time to investigate the problems and fix 
the code.

It would be best to start with just getting the lloop block device to work 
reliably, and use lctl set_param debug=+malloc to find allocations along the 
IO path, then move on to debugging swap.

Cheers, Andreas

 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org 
 [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of John Hanks
 Sent: giovedì, 18. agosto 2011 05:55
 To: land...@scalableinformatics.com
 Cc: lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] Swap over lustre
 
 On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
 land...@scalableinformatics.com wrote:
 On 08/17/2011 10:43 PM, John Hanks wrote:
 As a rule of thumb, you should try to keep the path to swap as simple as
 possible.  No memory/buffer allocations on the way to a paging event if
 you can possibly do this.
 
 I do have a long path there, will try simplifying that and see if it helps.
 
 The lustre client (and most NFS or even network block devices) all do
 memory allocation of buffers ... which is anathema to migrating pages
 out to disk.  You can easily wind up in a death spiral race condition
 (and it sounds like you are there).  You might be able to do something
 with iSCSI or SRP (though these also do block allocations and could
 trigger death spirals).  If you can limit the number of buffers they
 allocate, and then force them to allocate the buffers at startup (by
 forcing some activity to the block device, and then pin this memory so
 that they can't be ejected ...) you might have chance to do it as a
 block device.  I think SRP can do this, not sure if iSCSI initiators can
 pin buffers in ram.
 
 You might look at the swapz patches (we haven't integrated them into our
 kernel yet, but have been looking at it) to compress swap pages and
 store them ... in ram.  This may not work for you, but it could be an
 option.
 
 I wasn't aware of swapz, that sounds really interesting. The codes
 that run the nodes out of memory tend to be sequencing applications,
 which seem like good candidates for memory compression.
 
 Is there any particular reason you can't use a local drive for this
 (such as you don't have local drives, or they aren't big/fast enough)?
 
 We're doing this on diskless nodes. I'm not looking to get a huge
 amount of swap, just enough to provide a place for the root filesystem
 to page out of the tmpfs so we can squeeze out all the RAM possible
 for applications. Since I don't expect it to get heavily used, I'm
 considering running vblade on a server and carving out small aoe LUNs.
 It seems logical that if a host can boot off of iscsi or aoe, that you
 could have a swap space there but I've never tried it with either
 protocol.
 
 FWIW, mounting a file on lustre via loopback to provide a local
 scratch filesystem works really well.
 
 jbh
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-18 Thread Phil Sharfstein
Since you are considering other options, I would recommend NBD swap for this 
type of minimal swapping application.  Look at the Linux Terminal Server 
Project (LTSP) implementation using nbdswapd, a simple inetd-driven script that 
creates and exports NBD swap spaces.  We have been using this successfully on 
our diskless nodes for several years.

You can grab the nbdswapd script for the server-side and the ltsp-init-common 
script for an example client setup from any of the latest LTSP binary 
distributions.

-Phil

-Original Message-
From: lustre-discuss-boun...@lists.lustre.org on behalf of John Hanks
Sent: Thu 8/18/2011 7:30 AM
To: Temple Jason
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Swap over lustre
 
On Thu, Aug 18, 2011 at 12:36 AM, Temple  Jason jtem...@cscs.ch wrote:
 Hello,

 I experimented with swap on lustre in as many ways as possible (without 
 touching the code), and had the shortest path possible to no avail.  The code 
 is not able to handle it at all, and the system always hung.

 Without serious code rewrites, this isn't going to work for you.

 -Jason


Lacking the skills and time for what Andreas suggested, I think my
approach will be to abandon this direction for the moment. Thanks to
everyone for your responses. If I find time to learn how to provide
useful debugging and feedback, the list will be the first to know :)

jbh
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Swap over lustre

2011-08-17 Thread John Hanks
Hi,

I've been trying to get swap on lustre to work with not much success
using blockdev_attach and the resulting lloop0 device and using
losetup and the resulting loop device. This thread
(http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg00856.html)
claims that it works, but in all my attempts almost as soon as swap is
used (testing with memhog), the host hangs. In some cases it hangs
hard, but on occasion if I'm patient enough the OOM will eventually
kill something and the node will become responsive again. If I
carefully increase memory with each successive memhog run I can get
some pages to swap, but any real pressure always results in a hang.
I'm attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless
client over IB.

DIgging around search results for swap over NFS I've found a lot fo
discussion about race conditions and different patches to address
this, but CONFIG_NFS_SWAP seems to be missing from the redhat kernel.
And upon trying swap to an NFS server, I see the same behavior. Is
swap to a network device doomed to always fail on Redhat EL 5 and if
not, does anyone have a recipe for getting swap on lustre to work?

I've also fiddled with min_free_kbytes and swappiness in an attempt to
induce swapping before the node's memory is actually all gone but all
this results in is an earlier hang with less memory having been used.

Thanks,

jbh
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-17 Thread Joe Landman
On 08/17/2011 10:43 PM, John Hanks wrote:
 Hi,

 I've been trying to get swap on lustre to work with not much success
 using blockdev_attach and the resulting lloop0 device and using
 losetup and the resulting loop device. This thread
 (http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg00856.html)
 claims that it works, but in all my attempts almost as soon as swap is
 used (testing with memhog), the host hangs. In some cases it hangs
 hard, but on occasion if I'm patient enough the OOM will eventually
 kill something and the node will become responsive again. If I
 carefully increase memory with each successive memhog run I can get
 some pages to swap, but any real pressure always results in a hang.
 I'm attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless
 client over IB.

As a rule of thumb, you should try to keep the path to swap as simple as 
possible.  No memory/buffer allocations on the way to a paging event if 
you can possibly do this.

The lustre client (and most NFS or even network block devices) all do 
memory allocation of buffers ... which is anathema to migrating pages 
out to disk.  You can easily wind up in a death spiral race condition 
(and it sounds like you are there).  You might be able to do something 
with iSCSI or SRP (though these also do block allocations and could 
trigger death spirals).  If you can limit the number of buffers they 
allocate, and then force them to allocate the buffers at startup (by 
forcing some activity to the block device, and then pin this memory so 
that they can't be ejected ...) you might have chance to do it as a 
block device.  I think SRP can do this, not sure if iSCSI initiators can 
pin buffers in ram.

You might look at the swapz patches (we haven't integrated them into our 
kernel yet, but have been looking at it) to compress swap pages and 
store them ... in ram.  This may not work for you, but it could be an 
option.

Is there any particular reason you can't use a local drive for this 
(such as you don't have local drives, or they aren't big/fast enough)?

 DIgging around search results for swap over NFS I've found a lot fo
 discussion about race conditions and different patches to address
 this, but CONFIG_NFS_SWAP seems to be missing from the redhat kernel.
 And upon trying swap to an NFS server, I see the same behavior. Is
 swap to a network device doomed to always fail on Redhat EL 5 and if
 not, does anyone have a recipe for getting swap on lustre to work?

 I've also fiddled with min_free_kbytes and swappiness in an attempt to
 induce swapping before the node's memory is actually all gone but all
 this results in is an earlier hang with less memory having been used.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-17 Thread David Dillow
On Wed, 2011-08-17 at 22:57 -0400, Joe Landman wrote:
 The lustre client (and most NFS or even network block devices) all do 
 memory allocation of buffers ... which is anathema to migrating pages 
 out to disk.  You can easily wind up in a death spiral race condition 
 (and it sounds like you are there).  You might be able to do something 
 with iSCSI or SRP (though these also do block allocations and could 
 trigger death spirals).

Your post is generally correct, but minor nit here: there is no memory
allocation on the command path of the Linux SRP initiator, so the death
spiral is not possible there. I suspect the iSCSI initiator takes
similar precautions -- or uses mempools -- to avoid this fate, but I'm
not as familiar with that code.

Cheers,
Dave

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-17 Thread John Hanks
On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
land...@scalableinformatics.com wrote:
 On 08/17/2011 10:43 PM, John Hanks wrote:
 As a rule of thumb, you should try to keep the path to swap as simple as
 possible.  No memory/buffer allocations on the way to a paging event if
 you can possibly do this.

I do have a long path there, will try simplifying that and see if it helps.

 The lustre client (and most NFS or even network block devices) all do
 memory allocation of buffers ... which is anathema to migrating pages
 out to disk.  You can easily wind up in a death spiral race condition
 (and it sounds like you are there).  You might be able to do something
 with iSCSI or SRP (though these also do block allocations and could
 trigger death spirals).  If you can limit the number of buffers they
 allocate, and then force them to allocate the buffers at startup (by
 forcing some activity to the block device, and then pin this memory so
 that they can't be ejected ...) you might have chance to do it as a
 block device.  I think SRP can do this, not sure if iSCSI initiators can
 pin buffers in ram.

 You might look at the swapz patches (we haven't integrated them into our
 kernel yet, but have been looking at it) to compress swap pages and
 store them ... in ram.  This may not work for you, but it could be an
 option.

I wasn't aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.

 Is there any particular reason you can't use a local drive for this
 (such as you don't have local drives, or they aren't big/fast enough)?

We're doing this on diskless nodes. I'm not looking to get a huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don't expect it to get heavily used, I'm
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I've never tried it with either
protocol.

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

jbh
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-17 Thread Joe Landman
On 08/17/2011 11:42 PM, David Dillow wrote:
 On Wed, 2011-08-17 at 22:57 -0400, Joe Landman wrote:
 The lustre client (and most NFS or even network block devices) all do
 memory allocation of buffers ... which is anathema to migrating pages
 out to disk.  You can easily wind up in a death spiral race condition
 (and it sounds like you are there).  You might be able to do something
 with iSCSI or SRP (though these also do block allocations and could
 trigger death spirals).

 Your post is generally correct, but minor nit here: there is no memory
 allocation on the command path of the Linux SRP initiator, so the death
 spiral is not possible there. I suspect the iSCSI initiator takes

Thanks for clarifying that.  I know that during startup there is an 
allocation, but I wasn't sure after that.

 similar precautions -- or uses mempools -- to avoid this fate, but I'm
 not as familiar with that code.

I think they also try to pre-allocate as much as possible.

One issue we've seen in conjunction with these has been on some network 
drivers with skb allocations, in tight memory situations, it can cause 
some problems when there are very few free pages.  Usually we get a 
bunch of messages in the logs, but on some occasions, the network device 
shuts down (unable to allocate send/receive buffers).  Have seen this on 
igb, e1000e, and e1000 based networks.




 Cheers,
 Dave


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Swap over lustre

2011-08-17 Thread Andreas Dilger
On 2011-08-17, at 8:43 PM, John Hanks wrote:
 I've been trying to get swap on lustre to work with not much success
 using blockdev_attach and the resulting lloop0 device and using
 losetup and the resulting loop device. This thread
 (http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg00856.html)
 claims that it works, but in all my attempts almost as soon as swap is
 used (testing with memhog), the host hangs. In some cases it hangs
 hard, but on occasion if I'm patient enough the OOM will eventually
 kill something and the node will become responsive again. If I
 carefully increase memory with each successive memhog run I can get
 some pages to swap, but any real pressure always results in a hang.
 I'm attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless
 client over IB.

Using IB is important to try this out, since the Lustre RDMA will
use preallocate pages for the RPC, unlike TCP where there can be
be problems allocating the TCP receive buffers.

That said, the swap-on-Lustre code was never really finished.  If
you are interested to debug this and have some coding skills you
could probably get some help for debugging on the list.

You need to have a serial console attached to the client node, and
grab the stack traces from the client to see where it is stuck
allocating memory, and then remove/avoid/preallocate it.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss