At 10:55 AM 1/21/2009, Stas Oskin wrote:
>This the bit I don't understand - shouldn't the Lustre nodes sync 
>the data between themselves? If there is a shared storage device 
>needed on some medium, what then the Lustre storage nodes actually do?
>
>I mean, what is the idea of Lustre being cluster system, if it 
>requires a central shared storage device?

clustering and high available are NOT the same thing.  people often 
confuse them, but they're completely different concepts.

"clustering" is simply the concept of grouping a bunch of things into 
a "cluster" so that they can interact with eachother in a desired 
fashion.  so a cluster filesystem is simply one that can be managed 
by various nodes in the cluster.  this CAN mean that all nodes have 
read/write access to the filesystem, but it doesn't HAVE to mean 
that.  it CAN mean that multiple nodes participate in a single 
filesystem (which seems to be the case in Lustre)...in other 
words,  a distributed filesystem where some files are on some nodes 
and others are on other nodes is often defined as a "cluster 
filesystem" however, this has no implication of redundancy in the data.
generally a High-Availability filesystem" will add features like 
replication/redundancy in order to insure that the filesystem 
survives a failure of some kind.

Gluster combines both of these and can be implemented to be a HA 
Clustered filesystem.
without the HA translator, however, gluster (& lustre) are "cluster" 
filesystems, but you may run into a problem if there's a node failure.
Depending on your application this is acceptable, if it's not, then 
you have to add HA features.  Realizing, of course, ha features come 
at a cost (performance, disk, cpu, etc.).

>But the shared block device may be a distributed mirrored block device
>(like DRBD) which mirrors each data block as it is written to its peer
>node. In such a configuration the data is actually stored on both nodes in
>the failover pair. My guess is that this is not a common configuration for
>production use.
>
>
>AFAIK such config could be achieved without Lustre at all - just 
>with 2 severs acting as storage nodes. This of course would make an 
>active-passive mode, and waste 50% of the resources.

however, what you don't get in that environment, is read/write access 
to the filesystem from both nodes.  You'd get HA in that if one node 
failed, the other node could then mount the block device/filesystem 
and continue working, but you wouldn't be able to write directly to 
the block devices on both nodes at the same time.
Writes would happen through Lustre, which would most likely have one 
node in "control" of both the local and remote block device so the 
remote node couldn't write directly to it and instead would send it's 
writes over the network to the control node which would do the actual 
writing to the block devices.

the way most shared storage volume managers work is:
they put a system id tag somewhere on the physical device.  other 
nodes read this to see who "owns" the block device.
the owner periodically updates a timestamp on the device.
if this timestamp doesn't change in x number of cycles, then another 
node may take ownership.  So it overwrites the id tag.  waits and 
checks to see if another node did the same thing.  if not, then it 
can now claim ownership.
it then manages all writes to the physical device.
This is because, traditionally, filesystems used memory for file 
locking (and some caching) since they didn't have to ever worry about 
some other machine modifying the filesystem.
as such the volume manager needed to insure that only one machine 
would have access to the volumes/filesystems or you would have severe 
data integrity issues.

Then came along cluster aware filesystems such as OCFS2.
These eliminated filesystem caching and moved the locking from memory 
to the filesystem itself.
Now, you could have 2 nodes physically accessing the same disk device 
because they could tell if some other machine had a read-lock on a 
file or block of data.

These filesystems, again, have lower performance because your file 
locking is happening at disk speed, instead of memory speed, and you 
totally loose any benefit of caching since you have to insure data is 
finished writing to the physical media before releasing a lock.

the volume managers would do various levels of raid.
Since the clustered filesystems want to interact directly with the 
disk devices, the cluster aware volume managers don't always work 
(most still want only one machine to control the device at any given 
time).  so we've had to wait for these filesystems to include HA 
(raid/mirroring, etc) so that we could survive disk failures.

Ideally, we would have a cluster aware volume manager which handled 
the raid issues and could manage remote physical devices, and on top 
of that cluster aware filesystems (like ocfs2) which would give us 
the distributed access to the filesystem along with the conveniences 
of a volume manager.

I think we're years away from having anything like that, but it'd be ideal. 


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Reply via email to