date:20130710

Hi all - especially Jeff, Marcus and HL

I couldn't resist a quick test after compiling 3.4 beta. Looks
good. Same (very quick) times to do md5sums on both servers so it must
be doing local reads. So gluster is still in the running.

I repeat - you guys are doing a great job. Software like gluster is not
trivial. You are making great progress. My comments are not a rant and
are intended to be helpful.

All the best and many thanks again

Allan

On 10/07/13 17:55, HL wrote:
> Been there ...
> here is my 10cent advise
> a) Prepare for tomorrow
> b) Rest
> c) Think
> d) Plan
> e) act
> 
> I am sure it will work for you when calmed
> 
> Tech hints.
> ifconfig iface mtu 9000 or whatever your nic can afford
> Having a 100Mbit is not a good idea.
> I 've recently located a dual port 1Gbit nic on ebay for $15 USD
> Get them.
> 
> And last but not least
> In case you happen to have a switch between the nodes make sure that you
> enable jumbo frames on it.
> Otherwise you are in DEEP Trouble.
> 
> stat' ing a 120G file in my case takes miliseconds not even seconds.
> 
> GL
> Harry.
> 
> 
> On 10/07/2013 02:01 μμ, Allan Latham wrote:
>> Hi all
>>
>> Thanks to all those volunteers who are working to get gluster into a
>> state where it can be used for live work.
>>
>> I understand that you are giving your free time and I very much
>> appreciate it on this project and the many others we use for live
>> production work.
>>
>> There seems to be a problem with the way gluster is going.
>> For me it would be an ideal solution if it actually worked.
>>
>> I have a simple scenario and it just simply doesn't work. Reading over
>> the network when the file is available locally is plainly wrong. Our
>> application cannot take the performance hit nor the extra network
>> traffic.
>>
>> I would suggest:
>>
>> 1. get a simple minimalist configuration working - 2 hosts and
>> replication only.
>> 2. make it bomb-proof.
>> 2a. it must cope with network failures, random reboots etc.
>> 2b. if it stops it has to auto-recover quickly.
>> 2c. if it can't it needs thorough documentation and adequate logs so a
>> reasonable sysop can rescue it.
>> 2d. it needs a fast validation scanner which verifies that data is where
>> it should be and is identical everywhere (md5sum).
>> 3. make it efficient (read local whenever possible - use rsync
>> techniques - remove scalability obstacles so it doesn't get
>> exponentially slower as more files are replicated)
>> 4. when that works expand to multiple hosts and clever distribution
>> techniques.
>> (repeat items 2 and 3 in the more complex environment)
>>
>> If it doesn't work rock solid in a simple scenario it will never work in
>> a large scale cluster.
>>
>> Until point 3 is reached I cannot use it - which is a great
>> disappointment for me as well as the good guys doing the development.
>>
>> Good luck and thanks again
>>
>> Allan
>>
>>
>> On 04/07/13 13:10, Allan Latham wrote:
>>> Hi all
>>>
>>> Does anyone use read-subvolume?
>>>
>>> Has anyone tested read-subvolume?
>>>
>>> Does read-subvolume work in such a way that if the file is present on
>>> the local node the local copy is used rather than a remote one?
>>>
>>> Alternatively is there any way to configure (or patch) gluster to always
>>> prefer the local file?
>>>
>>> I have read everything available and have found no answer.
>>>
>>> Unison works very well in our environment but is not real time and needs
>>> to be run every few minutes and/or be kicked off with inotify.
>>>
>>> If I could get gluster to always read the local copy it would be a much
>>> better drop in replacement.
>>>
>>> This is a small scale deployment not a massive cluster but I can imagine
>>> there are many potential users of gluster in this mode. It should beat
>>> unison and similar solutions in every way - but it doesn't because it is
>>> reading from the network even when it has a local up-to-date copy. This
>>> can't be intended behaviour.
>>>
>>> So what have I configured wrong?
>>>
>>> Thanks in advance
>>>
>>> Allan
>>>
>>>
>>> On 02/07/13 13:38, Allan Latham wrote:
 Hi everyone

 I have installed 3.3.1-1 from the Debian repository you provide.

 I am using a simple 2 node cluster and running in replication mode. The
 connection between the nodes is limited to 100MB/sec (that's bits not
 bytes!). Usage will be mainly for read access and since there is always
 a local copy available [ exactly 2 replicas on exactly 2 machines ] I
 expect very fast read performance. Writes are low volume and very
 infrequent - performance is not an issue.

 Almost everything works as I would expect.

 Write speed is limited to 10Mb (bytes) per second which is what I would
 expect and is adequate for the application.

 But read speed is either super fast or 10Mb/sec. i.e. read operations
 take place on the local copy or the remote seemingly at random.

 This not the 'small files problem'. I am awar

Re: [Gluster-users] unable to resolve brick error


By the way, this less than useful error message has been reworked for 3.4.

On 07/10/2013 05:54 PM, Joe Julian wrote:
That error means (and if it means this, then why doesn't it just say 
this???) that the hostname provided could not be converted to its 
uuid. That probably means that the hostname assigned to the brick is 
not in the peer list.


The hostname of the brick has to be a case insensitive match for the 
peer hostname as displayed in "peer probe" (function uses 
"strncasecmp") or the brick hostname has to be able to resolve to an 
ip address which then must resolve to a hostname that matches the peer 
hostname.


On 07/10/2013 05:18 PM, Matthew Sacks wrote:

Here is the startup sequence: https://gist.github.com/msacks/5971418


On Wed, Jul 10, 2013 at 3:02 PM, Matthew Sacks > wrote:


Hello,
I have a gluster cluster which keeps complaining about
ops.c:842:glusterd_op_stage_start_volume] 0-: Unable to resolve
brick gluster1:/export/brick1/sdb1

here is the full output : https://gist.github.com/msacks/5970713

Not sure how this happened or how to fix it.

All my peers are connected.

I tried to stop and start the volume but that didn't work.

Cheers for your help.




___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users




___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] unable to resolve brick error

That error means (and if it means this, then why doesn't it just say 
this???) that the hostname provided could not be converted to its uuid. 
That probably means that the hostname assigned to the brick is not in 
the peer list.


The hostname of the brick has to be a case insensitive match for the 
peer hostname as displayed in "peer probe" (function uses "strncasecmp") 
or the brick hostname has to be able to resolve to an ip address which 
then must resolve to a hostname that matches the peer hostname.


On 07/10/2013 05:18 PM, Matthew Sacks wrote:

Here is the startup sequence: https://gist.github.com/msacks/5971418


On Wed, Jul 10, 2013 at 3:02 PM, Matthew Sacks > wrote:


Hello,
I have a gluster cluster which keeps complaining about
ops.c:842:glusterd_op_stage_start_volume] 0-: Unable to resolve
brick gluster1:/export/brick1/sdb1

here is the full output : https://gist.github.com/msacks/5970713

Not sure how this happened or how to fix it.

All my peers are connected.

I tried to stop and start the volume but that didn't work.

Cheers for your help.




___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] unable to resolve brick error

2013-07-10 Thread Todd Stansell

Check out https://bugzilla.redhat.com/show_bug.cgi?id=911290

It seems similar so hopefully it'll help...

Todd

On Wed, Jul 10, 2013 at 05:18:46PM -0700, Matthew Sacks wrote:
> Here is the startup sequence: https://gist.github.com/msacks/5971418
> 
> 
> On Wed, Jul 10, 2013 at 3:02 PM, Matthew Sacks wrote:
> 
> > Hello,
> > I have a gluster cluster which keeps complaining about
> > ops.c:842:glusterd_op_stage_start_volume] 0-: Unable to resolve brick
> > gluster1:/export/brick1/sdb1
> >
> > here is the full output : https://gist.github.com/msacks/5970713
> >
> > Not sure how this happened or how to fix it.
> >
> > All my peers are connected.
> >
> > I tried to stop and start the volume but that didn't work.
> >
> > Cheers for your help.
> >

> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] unable to resolve brick error

2013-07-10 Thread Matthew Sacks

Here is the startup sequence: https://gist.github.com/msacks/5971418


On Wed, Jul 10, 2013 at 3:02 PM, Matthew Sacks wrote:

> Hello,
> I have a gluster cluster which keeps complaining about
> ops.c:842:glusterd_op_stage_start_volume] 0-: Unable to resolve brick
> gluster1:/export/brick1/sdb1
>
> here is the full output : https://gist.github.com/msacks/5970713
>
> Not sure how this happened or how to fix it.
>
> All my peers are connected.
>
> I tried to stop and start the volume but that didn't work.
>
> Cheers for your help.
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

And here is ps ax | grep gluster from both nodes when fw1 is offline.  Note I 
have it mounted right now with the 'backupvolfile-server= 
mount option.  The ps ax | grep gluster output looks the same now as it did 
when both nodes were online.

>From fw1:
[root@chicago-fw1 gregs]# ./ruletest.sh
[root@chicago-fw1 gregs]#
[root@chicago-fw1 gregs]#
[root@chicago-fw1 gregs]# ps ax | grep gluster
1019 ?Ssl0:09 /usr/sbin/glusterd -p /run/glusterd.pid
1274 ?Ssl0:32 /usr/sbin/glusterfsd -s 192.168.253.1 --volfile-id 
firewall-scripts.192.168.253.1.gluster-fw1 -p 
/var/lib/glusterd/vols/firewall-scripts/run/192.168.253.1-gluster-fw1.pid -S 
/var/run/3eea976403bb07230cae75b885406920.socket --brick-name /gluster-fw1 -l 
/var/log/glusterfs/bricks/gluster-fw1.log --xlator-option 
*-posix.glusterd-uuid=e13d53de-c7ed-4e63-bcb1-dc69ae25cc15 --brick-port 49152 
--xlator-option firewall-scripts-server.listen-port=49152
1280 ?Ssl0:05 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log 
-S /var/run/ec00b40c3ed179eccfdd89f5fcd540cc.socket
1285 ?Ssl0:05 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/fa9d586a6fab73a52bba6fc92ddd5d91.socket --xlator-option 
*replicate*.node-uuid=e13d53de-c7ed-4e63-bcb1-dc69ae25cc15
12649 ?Ssl0:00 /usr/sbin/glusterfs --volfile-id=/firewall-scripts 
--volfile-server=192.168.253.1 /firewall-scripts
12991 pts/1S+ 0:00 grep --color=auto gluster
[root@chicago-fw1 gregs]#
[root@chicago-fw1 gregs]# more ruletest.sh
iptables -I INPUT 1 -i enp5s4 -s 192.168.253.2 -j REJECT
[root@chicago-fw1 gregs]#

You can see from above fw1 is now offline.  Here are the gluster processes 
still on fw2 - they look the same as before.

[root@chicago-fw2 gregs]# ps ax | grep gluster
1027 ?Ssl0:11 /usr/sbin/glusterd -p /run/glusterd.pid
1291 ?Ssl0:14 /usr/sbin/glusterfsd -s 192.168.253.2 --volfile-id 
firewall-scripts.192.168.253.2.gluster-fw2 -p 
/var/lib/glusterd/vols/firewall-scripts/run/192.168.253.2-gluster-fw2.pid -S 
/var/run/380dca5c55990acea8ab30f5a08375a7.socket --brick-name /gluster-fw2 -l 
/var/log/glusterfs/bricks/gluster-fw2.log --xlator-option 
*-posix.glusterd-uuid=a2334360-d1d3-40c1-8c0e-7d62a5318899 --brick-port 49152 
--xlator-option firewall-scripts-server.listen-port=49152
1306 ?Ssl0:06 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log 
-S /var/run/12903cdbca94bee4abfc3b4df24e2e61.socket
1310 ?Ssl0:06 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/a2dee45b1271f43ae8a8d9003567b428.socket --xlator-option 
*replicate*.node-uuid=a2334360-d1d3-40c1-8c0e-7d62a5318899
12663 ?Ssl0:01 /usr/sbin/glusterfs --volfile-id=/firewall-scripts 
--volfile-server=192.168.253.2 /firewall-scripts
13008 pts/0S+ 0:00 grep --color=auto gluster
[root@chicago-fw2 gregs]#


-  Greg

From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Greg Scott
Sent: Wednesday, July 10, 2013 4:57 PM
To: 'raghav'; gluster-users@gluster.org List
Subject: Re: [Gluster-users] One node goes offline, the other node can't see 
the replicated volume anymore

* It looks like the brick processes on fw2 machine are not running and 
hence when fw1 is down, the
* entire replication process is stalled. can u do a ps and get the 
status of all the gluster processes and
* ensure that the brick process is up on fw2.
I was away from this most of the day.  Here is a ps ax | grep gluster from both 
fw1 and fw2 while both nodes are online.

>From fw1:

[root@chicago-fw1 glusterfs]# ps ax | grep gluster
1019 ?Ssl0:09 /usr/sbin/glusterd -p /run/glusterd.pid
1274 ?Ssl0:32 /usr/sbin/glusterfsd -s 192.168.253.1 --volfile-id 
firewall-scripts.192.168.253.1.gluster-fw1 -p 
/var/lib/glusterd/vols/firewall-scripts/run/192.168.253.1-gluster-fw1.pid -S 
/var/run/3eea976403bb07230cae75b885406920.socket --brick-name /gluster-fw1 -l 
/var/log/glusterfs/bricks/gluster-fw1.log --xlator-option 
*-posix.glusterd-uuid=e13d53de-c7ed-4e63-bcb1-dc69ae25cc15 --brick-port 49152 
--xlator-option firewall-scripts-server.listen-port=49152
1280 ?Ssl0:05 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log 
-S /var/run/ec00b40c3ed179eccfdd89f5fcd540cc.socket
1285 ?Ssl0:05 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/fa9d586a6fab73a52bba6fc92ddd5d91.socket --xlator-

[Gluster-users] unable to resolve brick error

2013-07-10 Thread Matthew Sacks

Hello,
I have a gluster cluster which keeps complaining about
ops.c:842:glusterd_op_stage_start_volume] 0-: Unable to resolve brick
gluster1:/export/brick1/sdb1

here is the full output : https://gist.github.com/msacks/5970713

Not sure how this happened or how to fix it.

All my peers are connected.

I tried to stop and start the volume but that didn't work.

Cheers for your help.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

>   It looks like the brick processes on fw2 machine are not running and 
> hence when fw1 is down, the
>   entire replication process is stalled. can u do a ps and get the status 
> of all the gluster processes and
>   ensure that the brick process is up on fw2.

I was away from this most of the day.  Here is a ps ax | grep gluster from both 
fw1 and fw2 while both nodes are online.

>From fw1:

[root@chicago-fw1 glusterfs]# ps ax | grep gluster
 1019 ?Ssl0:09 /usr/sbin/glusterd -p /run/glusterd.pid
 1274 ?Ssl0:32 /usr/sbin/glusterfsd -s 192.168.253.1 --volfile-id 
firewall-scripts.192.168.253.1.gluster-fw1 -p 
/var/lib/glusterd/vols/firewall-scripts/run/192.168.253.1-gluster-fw1.pid -S 
/var/run/3eea976403bb07230cae75b885406920.socket --brick-name /gluster-fw1 -l 
/var/log/glusterfs/bricks/gluster-fw1.log --xlator-option 
*-posix.glusterd-uuid=e13d53de-c7ed-4e63-bcb1-dc69ae25cc15 --brick-port 49152 
--xlator-option firewall-scripts-server.listen-port=49152
 1280 ?Ssl0:05 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log 
-S /var/run/ec00b40c3ed179eccfdd89f5fcd540cc.socket
 1285 ?Ssl0:05 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/fa9d586a6fab73a52bba6fc92ddd5d91.socket --xlator-option 
*replicate*.node-uuid=e13d53de-c7ed-4e63-bcb1-dc69ae25cc15
12649 ?Ssl0:00 /usr/sbin/glusterfs --volfile-id=/firewall-scripts 
--volfile-server=192.168.253.1 /firewall-scripts
12959 pts/1S+ 0:00 grep --color=auto gluster
[root@chicago-fw1 glusterfs]#

And from fw2:

[root@chicago-fw2 gregs]# ps ax | grep gluster
 1027 ?Ssl0:11 /usr/sbin/glusterd -p /run/glusterd.pid
 1291 ?Ssl0:14 /usr/sbin/glusterfsd -s 192.168.253.2 --volfile-id 
firewall-scripts.192.168.253.2.gluster-fw2 -p 
/var/lib/glusterd/vols/firewall-scripts/run/192.168.253.2-gluster-fw2.pid -S 
/var/run/380dca5c55990acea8ab30f5a08375a7.socket --brick-name /gluster-fw2 -l 
/var/log/glusterfs/bricks/gluster-fw2.log --xlator-option 
*-posix.glusterd-uuid=a2334360-d1d3-40c1-8c0e-7d62a5318899 --brick-port 49152 
--xlator-option firewall-scripts-server.listen-port=49152
 1306 ?Ssl0:06 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log 
-S /var/run/12903cdbca94bee4abfc3b4df24e2e61.socket
 1310 ?Ssl0:06 /usr/sbin/glusterfs -s localhost --volfile-id 
gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l 
/var/log/glusterfs/glustershd.log -S 
/var/run/a2dee45b1271f43ae8a8d9003567b428.socket --xlator-option 
*replicate*.node-uuid=a2334360-d1d3-40c1-8c0e-7d62a5318899
12663 ?Ssl0:01 /usr/sbin/glusterfs --volfile-id=/firewall-scripts 
--volfile-server=192.168.253.2 /firewall-scripts
12958 pts/0S+ 0:00 grep --color=auto gluster
[root@chicago-fw2 gregs]#

-   Greg


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] tips/nest practices for gluster rdma?

2013-07-10 Thread Matthew Nicholson

Ryan,

10(storage) nodes, I did some test w 1 brick per node, and another round w/
4 per node. Each is FDR connected, but all on the same switch.

I'd love to hear about your setup, gluster version, OFED stack etc



--
Matthew Nicholson
Research Computing Specialist
Harvard FAS Research Computing
matthew_nichol...@harvard.edu



On Wed, Jul 10, 2013 at 4:33 PM, Ryan Aydelott  wrote:

> How many nodes make up that volume that you were using for testing?
>
> Over 100 nodes running at QDR/IPoIB using 100 threads we we ran around
> 60GB/s read and somewhere in the 40GB/s for writes (iirc).
>
> On Jul 10, 2013, at 1:49 PM, Matthew Nicholson <
> matthew_nichol...@harvard.edu> wrote:
>
> Well, first of all,thank for the responses. The volume WAS failing over
> the tcp just as predicted,though WHY is unclear as the fabric is know
> working (has about 28K compute cores on it all doing heavy MPI testing on
> it), and the OFED/verbs stack is consistent across all client/storage
> systems (actually, the OS image is identical).
>
> Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes
> and effort around planning for 3.4 for this storage systems, specifically
> for RDMA support (well, with warnings to the team that it wasn't in/test
> for 3.3 and that all we could do was HOPE it was in 3.4 and in time for
> when we want to go live). we're getting "okay" performance out of IPoIB
> right now, and our bottle neck actually seems to be the fabric
> design/layout, as we're peaking at about 4.2GB/s writing 10TB over 160
> threads to this distributed volume.
>
> When it IS ready and in 3.4.1 (hopefully!), having good docs around it,
> and maybe even a simple printf for the tcp failover would be huge for us.
>
>
>
> --
> Matthew Nicholson
> Research Computing Specialist
> Harvard FAS Research Computing
> matthew_nichol...@harvard.edu
>
>
>
> On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift  wrote:
>
>> Hi guys,
>>
>> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code
>> still isn't in a good enough state for production usage with 3.4.0. :(
>>
>> There are still outstanding bugs with it, and I'm working to make the
>> Gluster Test Framework able to work with RDMA so we can help shake out
>> more of them:
>>
>>
>> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework
>>
>> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at
>> this stage. :)
>>
>> Regards and best wishes,
>>
>> Justin Clift
>>
>>
>> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote:
>> > Matthew,
>> >
>> > Personally - I have experienced this same problem (even with the mount
>> being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA
>> that also had TCP configured as a transport option (which obviously you do
>> based on the mounts you gave below), if there is ANY issue with RDMA not
>> working the mount will silently fall back to TCP. This problem is described
>> here: https://bugzilla.redhat.com/show_bug.cgi?id=982757
>> >
>> > The way to test for this behavior is create a new volume specifying
>> ONLY RDMA as the transport. If you mount this and your RDMA is broken for
>> whatever reason - it will simply fail to mount.
>> >
>> > Assuming this test fails, I would then tail the logs for the volume to
>> get a hint of what's going on. In my case there was an RDMA_CM kernel
>> module that was not loaded which started to matter as of 3.4beta2 IIRC as
>> they did a complete rewrite for this based on poor performance in prior
>> releases. The clue in my volume log file was "no such file or directory"
>> preceded with an rdma_cm.
>> >
>> > Hope that helps!
>> >
>> >
>> > -ryan
>> >
>> >
>> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson <
>> matthew_nichol...@harvard.edu> wrote:
>> >
>> >> Hey guys,
>> >>
>> >> So, we're testing Gluster RDMA storage, and are having some issues.
>> Things are working...just not as we expected them. THere isn't a whole lot
>> in the way, that I've foudn on docs for gluster rdma, aside from basically
>> "install gluster-rdma", create a volume with transport=rdma, and mount w/
>> transport=rdma
>> >>
>> >> I've done that...and the IB fabric is known to be good...however, a
>> volume created with transport=rdma,tcp and mounted w/ transport=rdma, still
>> seems to go over tcp?
>> >>
>> >> A little more info about the setup:
>> >>
>> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC
>> and a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for
>> management only, and we have all of the systems on this fabric configured
>> with IPoIB, so there is eth0, and ib0 on each node.
>> >>
>> >> All storage nodes are peer'd using the ib0 interface, ie:
>> >>
>> >> gluster peer probe storage_node01-ib
>> >> etc
>> >>
>> >> thats all well and good.
>> >>
>> >> Volume was created:
>> >>
>> >> gluster volume create holyscratch transport rdma,tcp
>> holyscratch01-ib:/holyscratch01/brick
>> >> for i in `seq

Re: [Gluster-users] tips/nest practices for gluster rdma?

2013-07-10 Thread Ryan Aydelott

How many nodes make up that volume that you were using for testing?

Over 100 nodes running at QDR/IPoIB using 100 threads we we ran around 60GB/s 
read and somewhere in the 40GB/s for writes (iirc).

On Jul 10, 2013, at 1:49 PM, Matthew Nicholson  
wrote:

> Well, first of all,thank for the responses. The volume WAS failing over the 
> tcp just as predicted,though WHY is unclear as the fabric is know working 
> (has about 28K compute cores on it all doing heavy MPI testing on it), and 
> the OFED/verbs stack is consistent across all client/storage systems 
> (actually, the OS image is identical). 
> 
> Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes and 
> effort around planning for 3.4 for this storage systems, specifically for 
> RDMA support (well, with warnings to the team that it wasn't in/test for 3.3 
> and that all we could do was HOPE it was in 3.4 and in time for when we want 
> to go live). we're getting "okay" performance out of IPoIB right now, and our 
> bottle neck actually seems to be the fabric design/layout, as we're peaking 
> at about 4.2GB/s writing 10TB over 160 threads to this distributed volume. 
> 
> When it IS ready and in 3.4.1 (hopefully!), having good docs around it, and 
> maybe even a simple printf for the tcp failover would be huge for us. 
> 
> 
> 
> --
> Matthew Nicholson
> Research Computing Specialist
> Harvard FAS Research Computing
> matthew_nichol...@harvard.edu
> 
> 
> 
> On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift  wrote:
>> Hi guys,
>> 
>> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code
>> still isn't in a good enough state for production usage with 3.4.0. :(
>> 
>> There are still outstanding bugs with it, and I'm working to make the
>> Gluster Test Framework able to work with RDMA so we can help shake out
>> more of them:
>> 
>>   
>> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework
>> 
>> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at
>> this stage. :)
>> 
>> Regards and best wishes,
>> 
>> Justin Clift
>> 
>> 
>> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote:
>> > Matthew,
>> >
>> > Personally - I have experienced this same problem (even with the mount 
>> > being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA 
>> > that also had TCP configured as a transport option (which obviously you do 
>> > based on the mounts you gave below), if there is ANY issue with RDMA not 
>> > working the mount will silently fall back to TCP. This problem is 
>> > described here: https://bugzilla.redhat.com/show_bug.cgi?id=982757
>> >
>> > The way to test for this behavior is create a new volume specifying ONLY 
>> > RDMA as the transport. If you mount this and your RDMA is broken for 
>> > whatever reason - it will simply fail to mount.
>> >
>> > Assuming this test fails, I would then tail the logs for the volume to get 
>> > a hint of what's going on. In my case there was an RDMA_CM kernel module 
>> > that was not loaded which started to matter as of 3.4beta2 IIRC as they 
>> > did a complete rewrite for this based on poor performance in prior 
>> > releases. The clue in my volume log file was "no such file or directory" 
>> > preceded with an rdma_cm.
>> >
>> > Hope that helps!
>> >
>> >
>> > -ryan
>> >
>> >
>> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson 
>> >  wrote:
>> >
>> >> Hey guys,
>> >>
>> >> So, we're testing Gluster RDMA storage, and are having some issues. 
>> >> Things are working...just not as we expected them. THere isn't a whole 
>> >> lot in the way, that I've foudn on docs for gluster rdma, aside from 
>> >> basically "install gluster-rdma", create a volume with transport=rdma, 
>> >> and mount w/ transport=rdma
>> >>
>> >> I've done that...and the IB fabric is known to be good...however, a 
>> >> volume created with transport=rdma,tcp and mounted w/ transport=rdma, 
>> >> still seems to go over tcp?
>> >>
>> >> A little more info about the setup:
>> >>
>> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC and 
>> >> a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for 
>> >> management only, and we have all of the systems on this fabric configured 
>> >> with IPoIB, so there is eth0, and ib0 on each node.
>> >>
>> >> All storage nodes are peer'd using the ib0 interface, ie:
>> >>
>> >> gluster peer probe storage_node01-ib
>> >> etc
>> >>
>> >> thats all well and good.
>> >>
>> >> Volume was created:
>> >>
>> >> gluster volume create holyscratch transport rdma,tcp 
>> >> holyscratch01-ib:/holyscratch01/brick
>> >> for i in `seq -w 2 10` ; do gluster volume add-brick holyscratch 
>> >> holyscratch${i}-ib:/holyscratch${i}/brick; done
>> >>
>> >> yielding:
>> >>
>> >> Volume Name: holyscratch
>> >> Type: Distribute
>> >> Volume ID: 788e74dc-6ae2-4aa5-8252-2f30262f0141
>> >> Status: Started
>> >> Number of Bricks: 10
>> >> Transport-type: tcp,rdma
>> >> Bricks:
>> >> Brick1: holyscrat

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Joe Landman


On 07/10/2013 03:18 PM, Joe Julian wrote:


The "small file" complaint is all about latency though. There's very
little disk overhead (all inode lookups) to doing a self-heal check. "ls
-l" on a 50k file directory and nearly all the delay is from network RTT
for self-heal checks (check that with wireshark).



Try it with localhost.  Build a small test gluster brick, take 
networking out of the loop, create 50k files, and launch the self heal. 
 RTT is part of it, but not the majority (last I checked it wasn't a 
significant fraction relative to other metadata bits).


I did an experiment with 3.3.x a while ago with 2x ramdisks I created a 
set of files, looped  back with losetup, built xfs fs atop them, 
mirrored them with glusterfs, and then set about to doing metadata/small 
file heavy workloads.  Performance was still abysmal.  Pretty sure none 
of that was RTT.  Definitely a stack traversal problem, but I didn't 
trace it far enough back to be definitively sure where it was.



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]



On 07/10/2013 11:51 AM, Joe Landman wrote:

On 07/10/2013 02:36 PM, Joe Julian wrote:


1) http://www.solarflare.com makes sub microsecond latency adapters that
can utilize a userspace driver pinned to the cpu doing the request
eliminating a context switch


We've used open-onload in the past on Solarflare hardware.  And with 
GlusterFS.


Just say no.  Seriously.  You don't want to go there.

Bummer. That sounded like an interesting idea.



2) http://www.aristanetworks.com/en/products/7100t is a 2.5 microsecond
switch


Neither choice will impact overall performance much for GlusterFS, 
even in heavily loaded situations.


What impacts performance more than anything else is node/brick design, 
implementation, and specific choices in that mix.  Storage latency, 
bandwidth, and overall design will be more impactful than low latency 
networking.  Distribution, kernel and filesystem choices (including 
layout, lower level features, etc.) will matter significantly more 
than low latency networking.  You can completely remove the networking 
impact by trying your changes out on localhost, and seeing what the 
impact your design changes have.


If you don't start out with a fast box, you are not going to have fast 
aggregated storage.  This observation has not changed since the pre 
2.0 GlusterFS days (its as true today as it was years ago).


The "small file" complaint is all about latency though. There's very 
little disk overhead (all inode lookups) to doing a self-heal check. "ls 
-l" on a 50k file directory and nearly all the delay is from network RTT 
for self-heal checks (check that with wireshark).

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] tips/nest practices for gluster rdma?

2013-07-10 Thread Justin Clift

On 10/07/2013, at 8:05 PM, Matthew Nicholson wrote:
> justin,
> 
> yeah, this fabirc is all bran new mellanox, and all nodes are running their 
> v2 stack.

Cool.  The only thing that worries me about the v2 stack, is they've dropped SDP
support.  SDP seemed to have limited scope (speeding up IPoIB), but for that
specific use case it was apparently brilliant.

Some people are quoting much higher throughput numbers with v1 + SDP, vs IPoIB 
on
v2:

  http://community.mellanox.com/message/1924#1852

This is just from what I've read though, no hands on experience with them. :(

> of for a beg report, sure thing. I was thinking i would tack on a comment 
> here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=982757
> 
> since thats about the silent failure.

Good thought, that would work. :)

Regards and best wishes,

Justin Clift

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] tips/nest practices for gluster rdma?

2013-07-10 Thread Matthew Nicholson

justin,

yeah, this fabirc is all bran new mellanox, and all nodes are running their
v2 stack.

of for a beg report, sure thing. I was thinking i would tack on a comment
here:

https://bugzilla.redhat.com/show_bug.cgi?id=982757

since thats about the silent failure.

--
Matthew Nicholson
Research Computing Specialist
Harvard FAS Research Computing
matthew_nichol...@harvard.edu



On Wed, Jul 10, 2013 at 3:00 PM, Justin Clift  wrote:

> On 10/07/2013, at 7:49 PM, Matthew Nicholson wrote:
> > Well, first of all,thank for the responses. The volume WAS failing over
> the tcp just as predicted,though WHY is unclear as the fabric is know
> working (has about 28K compute cores on it all doing heavy MPI testing on
> it), and the OFED/verbs stack is consistent across all client/storage
> systems (actually, the OS image is identical).
> >
> > Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of
> hopes and effort around planning for 3.4 for this storage systems,
> specifically for RDMA support (well, with warnings to the team that it
> wasn't in/test for 3.3 and that all we could do was HOPE it was in 3.4 and
> in time for when we want to go live). we're getting "okay" performance out
> of IPoIB right now, and our bottle neck actually seems to be the fabric
> design/layout, as we're peaking at about 4.2GB/s writing 10TB over 160
> threads to this distributed volume.
>
> Out of curiosity, are you running the stock OS provided infiniband stack,
> or are you using the "vendor optimised" version?  (eg "Mellanox OFED" if
> you're using Mellanox cards)
>
> Asking because although I've not personally done any perf measurements
> between them, Mellanox swears the new v2 of their OFED stack is much higher
> performance than both the stock drivers or their v1 stack.  IPoIB is
> especially tuned.
>
> I'd really like to get around to testing that some time, but it won't be
> soon. :(
>
>
> > When it IS ready and in 3.4.1 (hopefully!), having good docs around it,
> and maybe even a simple printf for the tcp failover would be huge for us.
>
> Would you be ok to create a Bugzilla ticket, asking for that printf item?
>
>
> https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS&component=rdma
>
> It doesn't have to be complicated or super in depth or anything. :)
>
> Asking because when something is a ticket, the "task" is much easier to
> hand
> to someone so it gets done.
>
> If that's too much effort though, just tell me what you'd like as the
> ticket
> summary line + body text and I'll go create it. :)
>
> Regards and best wishes,
>
> Justin Clift
>
> --
> Open Source and Standards @ Red Hat
>
> twitter.com/realjustinclift
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] tips/nest practices for gluster rdma?

2013-07-10 Thread Justin Clift

On 10/07/2013, at 7:49 PM, Matthew Nicholson wrote:
> Well, first of all,thank for the responses. The volume WAS failing over the 
> tcp just as predicted,though WHY is unclear as the fabric is know working 
> (has about 28K compute cores on it all doing heavy MPI testing on it), and 
> the OFED/verbs stack is consistent across all client/storage systems 
> (actually, the OS image is identical). 
> 
> Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes and 
> effort around planning for 3.4 for this storage systems, specifically for 
> RDMA support (well, with warnings to the team that it wasn't in/test for 3.3 
> and that all we could do was HOPE it was in 3.4 and in time for when we want 
> to go live). we're getting "okay" performance out of IPoIB right now, and our 
> bottle neck actually seems to be the fabric design/layout, as we're peaking 
> at about 4.2GB/s writing 10TB over 160 threads to this distributed volume.

Out of curiosity, are you running the stock OS provided infiniband stack, or 
are you using the "vendor optimised" version?  (eg "Mellanox OFED" if you're 
using Mellanox cards)

Asking because although I've not personally done any perf measurements between 
them, Mellanox swears the new v2 of their OFED stack is much higher performance 
than both the stock drivers or their v1 stack.  IPoIB is especially tuned.

I'd really like to get around to testing that some time, but it won't be soon. 
:(

> When it IS ready and in 3.4.1 (hopefully!), having good docs around it, and 
> maybe even a simple printf for the tcp failover would be huge for us.

Would you be ok to create a Bugzilla ticket, asking for that printf item?

  https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS&component=rdma

It doesn't have to be complicated or super in depth or anything. :)

Asking because when something is a ticket, the "task" is much easier to hand
to someone so it gets done.

If that's too much effort though, just tell me what you'd like as the ticket
summary line + body text and I'll go create it. :)

Regards and best wishes,

Justin Clift

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Joe Landman


On 07/10/2013 02:36 PM, Joe Julian wrote:


1) http://www.solarflare.com makes sub microsecond latency adapters that
can utilize a userspace driver pinned to the cpu doing the request
eliminating a context switch


We've used open-onload in the past on Solarflare hardware.  And with 
GlusterFS.


Just say no.  Seriously.  You don't want to go there.


2) http://www.aristanetworks.com/en/products/7100t is a 2.5 microsecond
switch


Neither choice will impact overall performance much for GlusterFS, even 
in heavily loaded situations.


What impacts performance more than anything else is node/brick design, 
implementation, and specific choices in that mix.  Storage latency, 
bandwidth, and overall design will be more impactful than low latency 
networking.  Distribution, kernel and filesystem choices (including 
layout, lower level features, etc.) will matter significantly more than 
low latency networking.  You can completely remove the networking impact 
by trying your changes out on localhost, and seeing what the impact your 
design changes have.


If you don't start out with a fast box, you are not going to have fast 
aggregated storage.  This observation has not changed since the pre 2.0 
GlusterFS days (its as true today as it was years ago).


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] tips/nest practices for gluster rdma?

2013-07-10 Thread Matthew Nicholson

Well, first of all,thank for the responses. The volume WAS failing over the
tcp just as predicted,though WHY is unclear as the fabric is know working
(has about 28K compute cores on it all doing heavy MPI testing on it), and
the OFED/verbs stack is consistent across all client/storage systems
(actually, the OS image is identical).

Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes
and effort around planning for 3.4 for this storage systems, specifically
for RDMA support (well, with warnings to the team that it wasn't in/test
for 3.3 and that all we could do was HOPE it was in 3.4 and in time for
when we want to go live). we're getting "okay" performance out of IPoIB
right now, and our bottle neck actually seems to be the fabric
design/layout, as we're peaking at about 4.2GB/s writing 10TB over 160
threads to this distributed volume.

When it IS ready and in 3.4.1 (hopefully!), having good docs around it, and
maybe even a simple printf for the tcp failover would be huge for us.



--
Matthew Nicholson
Research Computing Specialist
Harvard FAS Research Computing
matthew_nichol...@harvard.edu



On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift  wrote:

> Hi guys,
>
> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code
> still isn't in a good enough state for production usage with 3.4.0. :(
>
> There are still outstanding bugs with it, and I'm working to make the
> Gluster Test Framework able to work with RDMA so we can help shake out
> more of them:
>
>
> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework
>
> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at
> this stage. :)
>
> Regards and best wishes,
>
> Justin Clift
>
>
> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote:
> > Matthew,
> >
> > Personally - I have experienced this same problem (even with the mount
> being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA
> that also had TCP configured as a transport option (which obviously you do
> based on the mounts you gave below), if there is ANY issue with RDMA not
> working the mount will silently fall back to TCP. This problem is described
> here: https://bugzilla.redhat.com/show_bug.cgi?id=982757
> >
> > The way to test for this behavior is create a new volume specifying ONLY
> RDMA as the transport. If you mount this and your RDMA is broken for
> whatever reason - it will simply fail to mount.
> >
> > Assuming this test fails, I would then tail the logs for the volume to
> get a hint of what's going on. In my case there was an RDMA_CM kernel
> module that was not loaded which started to matter as of 3.4beta2 IIRC as
> they did a complete rewrite for this based on poor performance in prior
> releases. The clue in my volume log file was "no such file or directory"
> preceded with an rdma_cm.
> >
> > Hope that helps!
> >
> >
> > -ryan
> >
> >
> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson <
> matthew_nichol...@harvard.edu> wrote:
> >
> >> Hey guys,
> >>
> >> So, we're testing Gluster RDMA storage, and are having some issues.
> Things are working...just not as we expected them. THere isn't a whole lot
> in the way, that I've foudn on docs for gluster rdma, aside from basically
> "install gluster-rdma", create a volume with transport=rdma, and mount w/
> transport=rdma
> >>
> >> I've done that...and the IB fabric is known to be good...however, a
> volume created with transport=rdma,tcp and mounted w/ transport=rdma, still
> seems to go over tcp?
> >>
> >> A little more info about the setup:
> >>
> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC
> and a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for
> management only, and we have all of the systems on this fabric configured
> with IPoIB, so there is eth0, and ib0 on each node.
> >>
> >> All storage nodes are peer'd using the ib0 interface, ie:
> >>
> >> gluster peer probe storage_node01-ib
> >> etc
> >>
> >> thats all well and good.
> >>
> >> Volume was created:
> >>
> >> gluster volume create holyscratch transport rdma,tcp
> holyscratch01-ib:/holyscratch01/brick
> >> for i in `seq -w 2 10` ; do gluster volume add-brick holyscratch
> holyscratch${i}-ib:/holyscratch${i}/brick; done
> >>
> >> yielding:
> >>
> >> Volume Name: holyscratch
> >> Type: Distribute
> >> Volume ID: 788e74dc-6ae2-4aa5-8252-2f30262f0141
> >> Status: Started
> >> Number of Bricks: 10
> >> Transport-type: tcp,rdma
> >> Bricks:
> >> Brick1: holyscratch01-ib:/holyscratch01/brick
> >> Brick2: holyscratch02-ib:/holyscratch02/brick
> >> Brick3: holyscratch03-ib:/holyscratch03/brick
> >> Brick4: holyscratch04-ib:/holyscratch04/brick
> >> Brick5: holyscratch05-ib:/holyscratch05/brick
> >> Brick6: holyscratch06-ib:/holyscratch06/brick
> >> Brick7: holyscratch07-ib:/holyscratch07/brick
> >> Brick8: holyscratch08-ib:/holyscratch08/brick
> >> Brick9: holyscratch09-ib:/holyscratch09/brick
> >> Brick10: holyscratch10-ib:/holyscratch10

Re: [Gluster-users] fuse 3.3.1 fails/crashes on flush on Distributed-Striped-Replicate volume

2013-07-10 Thread Kushnir, Michael (NIH/NLM/LHC) [C]

I had the same problem with striped-replicated. 

https://bugzilla.redhat.com/show_bug.cgi?id=861423

Best, 
Michael 

-Original Message-
From: Benedikt Fraunhofer 
[mailto:benedikt.fraunhofer.l.gluster.fxy-3zz-...@traced.net] 
Sent: Monday, July 08, 2013 3:43 AM
To: Gluster-users@gluster.org
Subject: [Gluster-users] fuse 3.3.1 fails/crashes on flush on 
Distributed-Striped-Replicate volume

Hello List,

before filing a bug I wanted to check with the community if this is a known 
issue.

I'm running gluster 3.3.1 from the semiosis PPA on ubuntu 12.04.
I created a 2x2x2 volume:

gluster> volume info

Volume Name: gv0-s2-r2
Type: Distributed-Striped-Replicate
Volume ID: 8e1f1e86-63c0-4c27-9c90-fef92bdd0ca1
Status: Started
Number of Bricks: 2 x 2 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 172.31.32.33:/gluster/brick0
Brick2: 172.31.32.34:/gluster/brick0
Brick3: 172.31.32.37:/gluster/brick0
Brick4: 172.31.32.38:/gluster/brick0
Brick5: 172.31.32.39:/gluster/brick0
Brick6: 172.31.32.40:/gluster/brick0
Brick7: 172.31.32.42:/gluster/brick0
Brick8: 172.31.32.47:/gluster/brick0

which mounts fine, but it started to smell fishy when vim wasn't able to write 
its swapfile, but I ignored that.
I hacked my stress-script together which makes the fuse-mount instantly 
unuseable. The script just creates random file- and directory-names and tries 
to write data to these:
Essentially it does something like:
 mkdir /mnt/a/C/C2/C2s/
 touch /mnt/a/C/C2/C2s/C2syqS30tsA5sctL2C8j22zd04VWsZDj3fZSWCK8Uo
 head -c 12837 /dev/zero >
/mnt/a/C/C2/C2s/C2syqS30tsA5sctL2C8j22zd04VWsZDj3fZSWCK8Uo

which results in:
 stress.sh: line 28:
/mnt/a/y/yK/yK2/yK2ujHHPRJYukCaqFhYkAN5MVDJxoAmTu89nM5auLA: Software caused 
connection abort

the following calls result in:
 mkdir: cannot create directory `/mnt/a/l': Transport endpoint is not connected 
because the mount is gone. I thought I hit the ext4 64bit-hash bug, reformatted 
to xfs, but same game.

note that this only happens when done quickly; if you wait between directory 
creation, touch and write it works:

root@gluster-71:/mnt/a# mkdir -p a/b/c/d root@gluster-71:/mnt/a# touch 
a/b/c/d/e root@gluster-71:/mnt/a# head -c 32 /dev/zero > a/b/c/d/e 
root@gluster-71:/mnt/a# mkdir -p b/c/d/; touch b/c/d/e; head -c 32 /dev/zero 
>b/c/d/e
-bash: b/c/d/e: Software caused connection abort

logging says:

[2013-07-05 15:24:05.764677] I [fuse-bridge.c:3376:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.19
[2013-07-05 15:24:05.765679] I
[afr-common.c:1965:afr_set_root_inode_on_first_lookup]
0-gv0-s2-r2-replicate-0: added root inode
[2013-07-05 15:24:05.765747] I
[afr-common.c:1965:afr_set_root_inode_on_first_lookup]
0-gv0-s2-r2-replicate-1: added root inode
[2013-07-05 15:24:05.767220] I
[afr-common.c:1965:afr_set_root_inode_on_first_lookup]
0-gv0-s2-r2-replicate-3: added root inode
[2013-07-05 15:24:05.767304] I
[afr-common.c:1965:afr_set_root_inode_on_first_lookup]
0-gv0-s2-r2-replicate-2: added root inode
[2013-07-05 15:24:09.709367] I [dict.c:317:dict_get]
(-->/usr/lib/glusterfs/3.3.1/xlator/cluster/replicate.so(afr_create_unwind+0x13c)
[0x7fd39d7262ac]
(-->/usr/lib/glusterfs/3.3.1/xlator/cluster/stripe.so(stripe_create_cbk+0x5fb)
[0x7fd39d507bab]
(-->/usr/lib/glusterfs/3.3.1/xlator/cluster/stripe.so(stripe_ctx_handle+0x90)
[0x7fd39d510150]))) 0-dict: !this ||
key=trusted.gv0-s2-r2-stripe-1.stripe-size
[2013-07-05 15:24:09.709446] E
[stripe-helpers.c:268:stripe_ctx_handle] 0-gv0-s2-r2-stripe-1: Failed to get 
stripe-size
[2013-07-05 15:24:09.714424] W [fuse-bridge.c:968:fuse_err_cbk]
0-glusterfs-fuse: 17: FLUSH() ERR => -1 (Invalid argument)
[2013-07-05 15:24:57.247208] I [dict.c:317:dict_get]
(-->/usr/lib/glusterfs/3.3.1/xlator/cluster/replicate.so(afr_create_unwind+0x13c)
[0x7fd39d7262ac]
(-->/usr/lib/glusterfs/3.3.1/xlator/cluster/stripe.so(stripe_create_cbk+0x5fb)
[0x7fd39d507bab]
(-->/usr/lib/glusterfs/3.3.1/xlator/cluster/stripe.so(stripe_ctx_handle+0x90)
[0x7fd39d510150]))) 0-dict: !this ||
key=trusted.gv0-s2-r2-stripe-0.stripe-size
[2013-07-05 15:24:57.247436] E
[stripe-helpers.c:268:stripe_ctx_handle] 0-gv0-s2-r2-stripe-0: Failed to get 
stripe-size pending frames:
frame : type(1) op(TRUNCATE)
frame : type(1) op(OPEN)
frame : type(1) op(FLUSH)
frame : type(1) op(FLUSH)
frame : type(1) op(FLUSH)
frame : type(1) op(FLUSH)

patchset: git://git.gluster.com/glusterfs.git
signal received: 8
time of crash: 2013-07-05 15:24:57
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0)[0x7fd3a1d224a0]
/usr/lib/glusterfs/3.3.1/xlator/cluster/stripe.so(stripe_truncate+0x28c)[0x7fd39d50287c]
/usr/lib/glusterfs/3.3.1/xlator/cluster/distribute.so(dht_truncate+0x172)[0x7fd39d2dd8c2]
/usr/lib/glusterfs/3.3.1/xlator/performance/write-behind.so(wb_truncate+0x49e)[0x7fd39d0a769e]
/usr/lib/glusterfs/3.3.1/

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]


My minimal donation:

On 07/10/2013 04:01 AM, Allan Latham wrote:

There seems to be a problem with the way gluster is going.
For me it would be an ideal solution if it actually worked.
Actually working is always the ideal. Actually working for all possible 
use cases... may be a little more difficult (though still ideal).

I have a simple scenario and it just simply doesn't work. Reading over
the network when the file is available locally is plainly wrong. Our
application cannot take the performance hit nor the extra network traffic.

It's not "wrong" just not the way you envision it.

Typically, in a scaled scenario where clustered storage has the 
strongest advantage, you'll have a limited number of storage servers and 
a much greater number of application servers. The likelihood that any of 
those application servers is going to have the file they want locally, 
even if they're shared-use, is pretty slim. Engineering for that 
probability is the "correct" solution in that use case.

I would suggest:

1. get a simple minimalist configuration working - 2 hosts and
replication only.
2. make it bomb-proof.
2a. it must cope with network failures, random reboots etc.
2b. if it stops it has to auto-recover quickly.
So far, all done within reasonable parameters. "bomb proof" is an 
obvious exaggeration and is unattainable. If you literally blow up all 
your servers, you're going to lose data.

2c. if it can't it needs thorough documentation and adequate logs so a
reasonable sysop can rescue it.
Define "reasonable sysop". Correcting from any failure that isn't 
automatic is going to require a certain amount of understanding about 
clustering, split-brain, and split-brain recovery. That's not your 
typical first-tier sysop, IMHO.

2d. it needs a fast validation scanner which verifies that data is where
it should be and is identical everywhere (md5sum).

md5sum isn't the fastest checksum algorithm.

3. make it efficient (read local whenever possible - use rsync
techniques - remove scalability obstacles so it doesn't get
exponentially slower as more files are replicated)
See earlier point about scaled systems. Also it does not get 
"exponentially slower as more files are replicated". That would be silly.

4. when that works expand to multiple hosts and clever distribution
techniques.
(repeat items 2 and 3 in the more complex environment)

If it doesn't work rock solid in a simple scenario it will never work in
a large scale cluster.
Not necessarily true. That's like comparing Apples to Orchards 
.


Until point 3 is reached I cannot use it - which is a great
disappointment for me as well as the good guys doing the development.
Consider expanding your thinking to bits you have more control over. 
Network latency is probably the biggest. Consider using low-latency 
10Gig cards(1) and switches(2) or infiniband.


Good luck and thanks again

Allan
1) http://www.solarflare.com makes sub microsecond latency adapters that 
can utilize a userspace driver pinned to the cpu doing the request 
eliminating a context switch
2) http://www.aristanetworks.com/en/products/7100t is a 2.5 microsecond 
switch
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Brian Candler


On 10/07/2013 13:58, Jeff Darcy wrote:



2d. it needs a fast validation scanner which verifies that data is where
it should be and is identical everywhere (md5sum).


How fast is fast?  What would be an acceptable time for such a scan on 
a volume containing (let's say) ten million files?


What would be nice is if the underlying filesystem had crypto checksums 
already (I'm thinking of ZFS) and exposed this information to the upper 
layers.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Jeff Darcy


On 07/10/2013 11:11 AM, Allan Latham wrote:

Correct me if I'm wrong but geo-replication is master/slave?


It is, today.  Multi-way is under development, but by its nature won't ever
have the same consistency guarantees as synchronous replication.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread HL


Been there ...
here is my 10cent advise
a) Prepare for tomorrow
b) Rest
c) Think
d) Plan
e) act

I am sure it will work for you when calmed

Tech hints.
ifconfig iface mtu 9000 or whatever your nic can afford
Having a 100Mbit is not a good idea.
I 've recently located a dual port 1Gbit nic on ebay for $15 USD
Get them.

And last but not least
In case you happen to have a switch between the nodes make sure that you 
enable jumbo frames on it.

Otherwise you are in DEEP Trouble.

stat' ing a 120G file in my case takes miliseconds not even seconds.

GL
Harry.


On 10/07/2013 02:01 μμ, Allan Latham wrote:

Hi all

Thanks to all those volunteers who are working to get gluster into a
state where it can be used for live work.

I understand that you are giving your free time and I very much
appreciate it on this project and the many others we use for live
production work.

There seems to be a problem with the way gluster is going.
For me it would be an ideal solution if it actually worked.

I have a simple scenario and it just simply doesn't work. Reading over
the network when the file is available locally is plainly wrong. Our
application cannot take the performance hit nor the extra network traffic.

I would suggest:

1. get a simple minimalist configuration working - 2 hosts and
replication only.
2. make it bomb-proof.
2a. it must cope with network failures, random reboots etc.
2b. if it stops it has to auto-recover quickly.
2c. if it can't it needs thorough documentation and adequate logs so a
reasonable sysop can rescue it.
2d. it needs a fast validation scanner which verifies that data is where
it should be and is identical everywhere (md5sum).
3. make it efficient (read local whenever possible - use rsync
techniques - remove scalability obstacles so it doesn't get
exponentially slower as more files are replicated)
4. when that works expand to multiple hosts and clever distribution
techniques.
(repeat items 2 and 3 in the more complex environment)

If it doesn't work rock solid in a simple scenario it will never work in
a large scale cluster.

Until point 3 is reached I cannot use it - which is a great
disappointment for me as well as the good guys doing the development.

Good luck and thanks again

Allan


On 04/07/13 13:10, Allan Latham wrote:

Hi all

Does anyone use read-subvolume?

Has anyone tested read-subvolume?

Does read-subvolume work in such a way that if the file is present on
the local node the local copy is used rather than a remote one?

Alternatively is there any way to configure (or patch) gluster to always
prefer the local file?

I have read everything available and have found no answer.

Unison works very well in our environment but is not real time and needs
to be run every few minutes and/or be kicked off with inotify.

If I could get gluster to always read the local copy it would be a much
better drop in replacement.

This is a small scale deployment not a massive cluster but I can imagine
there are many potential users of gluster in this mode. It should beat
unison and similar solutions in every way - but it doesn't because it is
reading from the network even when it has a local up-to-date copy. This
can't be intended behaviour.

So what have I configured wrong?

Thanks in advance

Allan


On 02/07/13 13:38, Allan Latham wrote:

Hi everyone

I have installed 3.3.1-1 from the Debian repository you provide.

I am using a simple 2 node cluster and running in replication mode. The
connection between the nodes is limited to 100MB/sec (that's bits not
bytes!). Usage will be mainly for read access and since there is always
a local copy available [ exactly 2 replicas on exactly 2 machines ] I
expect very fast read performance. Writes are low volume and very
infrequent - performance is not an issue.

Almost everything works as I would expect.

Write speed is limited to 10Mb (bytes) per second which is what I would
expect and is adequate for the application.

But read speed is either super fast or 10Mb/sec. i.e. read operations
take place on the local copy or the remote seemingly at random.

This not the 'small files problem'. I am aware that Gluster must use
network access for stat() etc. This is all about where the data comes
from on a read(). If I do an m5dum on a 200Mb file it takes either half
a second or 18 seconds.

There is an option read-subvolume.

I have tried to understand how this works from the documentation
available and from the few examples on the web.

I have added the option using:

gluster volume set X read-subvolume Y

It has no effect even after stopping and starting the volume,
remounting, restarting gluster servers etc.

What's more I fail to see how this option could ever work at all. The
configuration changes caused by the above command are rolled out to both
nodes - but what is right for one node is exactly the wrong
configuration for the other node.

Configs attached are in /var/lib/glusterd/vols/shared except
glusterd.vol which is in /etc/glusterfs.

Here is the ou

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Marcus Bointon

On 10 Jul 2013, at 17:11, Allan Latham  wrote:

> In short 'sync' replication is not an absolute must but we do use
> master/master quite a bit.

That's why I'm using gluster too. I'm running web servers that allow uploads, 
and if you're going to maintain a no-stickiness policy for balancing and 
failover performance, a write followed immediately by a read needs to work 
across any mixture of servers. That's what's stopping me using other solutions 
instead of gluster. If only I could get it to work reliably!

Marcus
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

Hi Jeff

OK - I've downloaded the source and I'm setting up to compile it for
Debian Wheezy.

I'll let you know how I get on. Maybe next week before I can run
preliminary tests.

Correct me if I'm wrong but geo-replication is master/slave?

We could maybe go with this in some scenarios as updates should be
controlled in any case so nominating a particular virtual server to be
the master is no problem. This is the 'roll out the same html files on
all the web servers' scenario. The web servers themselves mount this
volume RO. We have other similar config file scenarios.

Another use is really master/master and this is where we copy a backup
to a directory under unison control knowing that it will be safely on
two physical machines within a minute or two. Also use this as a lazy
way to move files between physical servers.

Later I want to also write log files to gluster so I know I've got them
on the other server if one of them goes down. The nearer this is to
'sync mode' the better but I can't stop logging just because the peer
machine is down or can't keep up. Unison can do this (I don't do it yet)
because it's async.

In short 'sync' replication is not an absolute must but we do use
master/master quite a bit.

I need to read up more on geo-replication.

All the best

Allan

On 10/07/13 15:39, Jeff Darcy wrote:
> On 07/10/2013 09:20 AM, Allan Latham wrote:
>> Where do I get a version that will solve my 'read local if we have the
>> file here' problem.
> 
> I would say 3.4 is already far better than 3.3 not only in terms of
> features but stability/maintainability/etc. as well, even though it's
> technically not out of beta yet.  You can get it here:
> 
> http://download.gluster.org/pub/gluster/glusterfs/3.4/3.4.0beta4/
> 
> Selection of the local subvolume (if there is one) should happen
> automatically, and multiple users have reported that this is in fact the
> case.  You can also use the read-subvolume-index option to gain explicit
> control over this decision at mount time (via the --xlator-option hook).
> 
>> My use case is exactly two servers at a server farm with 100Mbit between
>> them. This 100Mbit is also shared with the outside internet. Hence the
>> need to minimise use of this very limited resource.
> 
> That's very similar to the use case for the person who submitted the
> read-subvolume-index patch.
> 
>> We are still in evaluation. Current 'best' is what we are familiar with
>> = unison and inotify. I don't like it because it's really only a hack.
>> However it works. If inotify misses a change due to race conditions
>> unison gets run every five minutes anyway.
> 
> If you're running with that level of eventual consistency already, might
> geo-replication be a better fit for your needs?  It's a separate feature
> from normal (synchronous) replication, described in section 8 of the 3.3
> admin guide.
> 
> http://www.gluster.org/wp-content/uploads/2012/05/Gluster_File_System-3.3.0-Administration_Guide-en-US.pdf
> 
> 
> Unfortunately, I can't find the 3.4 admin guide on the Gluster site
> (another pet peeve) or I'd provide a link to that.
> 
> 
> 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster Self Heal

2013-07-10 Thread HL


On 10/07/2013 04:05 μμ, Vijay Bellur wrote:

A lot of volumes or a lot of delta to self-heal could trigger this crash.

3.3.2 containing this fix should  be out real soon now. Appreciate 
your patience in this regard.


Thanks,
Vijay 

I hope this update will reach the debian wheezy repo.

Regards.
Harry

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Jeff Darcy

On 07/10/2013 09:20 AM, Allan Latham wrote:

Where do I get a version that will solve my 'read local if we have the
file here' problem.

I would say 3.4 is already far better than 3.3 not only in terms of features
but stability/maintainability/etc. as well, even though it's technically not
out of beta yet. You can get it here:

http://download.gluster.org/pub/gluster/glusterfs/3.4/3.4.0beta4/

Selection of the local subvolume (if there is one) should happen automatically,
and multiple users have reported that this is in fact the case. You can also
use the read-subvolume-index option to gain explicit control over this decision
at mount time (via the --xlator-option hook).

My use case is exactly two servers at a server farm with 100Mbit between
them. This 100Mbit is also shared with the outside internet. Hence the
need to minimise use of this very limited resource.

That's very similar to the use case for the person who submitted the
read-subvolume-index patch.

We are still in evaluation. Current 'best' is what we are familiar with
= unison and inotify. I don't like it because it's really only a hack.
However it works. If inotify misses a change due to race conditions
unison gets run every five minutes anyway.

If you're running with that level of eventual consistency already, might
geo-replication be a better fit for your needs? It's a separate feature from
normal (synchronous) replication, described in section 8 of the 3.3 admin guide.

http://www.gluster.org/wp-content/uploads/2012/05/Gluster_File_System-3.3.0-Administration_Guide-en-US.pdf

Unfortunately, I can't find the 3.4 admin guide on the Gluster site (another
pet peeve) or I'd provide a link to that.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

Hi Jeff

Thanks for the reply and all the great work you are doing. I know how
hard it is - believe me.

Where do I get a version that will solve my 'read local if we have the
file here' problem.

My use case is exactly two servers at a server farm with 100Mbit between
them. This 100Mbit is also shared with the outside internet. Hence the
need to minimise use of this very limited resource.

Writes are rare and we just have to live with that load on the network.
Reads are very common and I need to keep these off the network.
Read/Write ratio is probably 1:1 or more.

Doing an md5sum on a local 500Mb file takes 500ms (probably cached - I
would have expected 5 seconds or so for real reads). On gluster it takes
500ms or 18 seconds (and in that case it's hogged the network for 18
seconds).

I'm willing to give a new version a try.

We are still in evaluation. Current 'best' is what we are familiar with
= unison and inotify. I don't like it because it's really only a hack.
However it works. If inotify misses a change due to race conditions
unison gets run every five minutes anyway.

It's an example of a failure mode which really does self heal.

PS my own use case is unlikely to hit exponential delays. We are talking
about a few Gb and a few tens of thousands of files. I was hoping to
help with your roadmap. My preferred method is always 'get the simple
case working properly first' then optimise.

All the best and thanks again for all your efforts

Allan

On 10/07/13 14:58, Jeff Darcy wrote:
> On 07/10/2013 07:01 AM, Allan Latham wrote:
>> I have a simple scenario and it just simply doesn't work. Reading over
>> the network when the file is available locally is plainly wrong. Our
>> application cannot take the performance hit nor the extra network
>> traffic.
> 
> Another victim of our release process.  :(  Code was added to choose the
> local subvolume whenever possible in *June 2012* (commit 0baa65b6). 
> Further fixes and related changes, including a user-submitted patch to
> force this choice for sites with more complex needs, have gone in since
> then.  None of them have made it into a release yet, since 3.4 is still
> in beta and the changes have not been backported into 3.3.anything
> (including 3.3.1 which I see you were using).  All I can offer is an
> apology.
> 
>> 1. get a simple minimalist configuration working - 2 hosts and
>> replication only.
>> 2. make it bomb-proof.
>> 2a. it must cope with network failures, random reboots etc.
>> 2b. if it stops it has to auto-recover quickly.
>> 2c. if it can't it needs thorough documentation and adequate logs so a
>> reasonable sysop can rescue it.
> 
> This is one of my own pet peeves.  I will personally be working on the
> internals documentation soon, so users will at least have a chance of
> understanding what the often-cryptic log messages really mean. 
> Improvements to logging, event reporting, and so on are also ongoing,
> albeit slowly and not under my direct purview.
> 
>> 2d. it needs a fast validation scanner which verifies that data is where
>> it should be and is identical everywhere (md5sum).
> 
> How fast is fast?  What would be an acceptable time for such a scan on a
> volume containing (let's say) ten million files?
> 
>> 3. make it efficient (read local whenever possible - use rsync
>> techniques - remove scalability obstacles so it doesn't get
>> exponentially slower as more files are replicated)
> 
> Can you explain "exponentially"?  The time for a full scan should
> increase *linearly* with number of files.  That's bad enough, and it's
> why we're starting to get away from reliance on full scans in favor of
> logging or journaling approaches, but if you're seeing exponential
> behavior then something is amiss.
> 
>> 4. when that works expand to multiple hosts and clever distribution
>> techniques.
> 
> That would be a fine sentiment for a new project, but it's not really an
> option when there are already thousands of users relying on the "clever
> distribution techniques" and many other features in production.  We do
> have to fix their bugs too, so we can't devote all of our resources to
> improving or reimplementing replication.  Believe me, I wish we could.
> 
> Thank you for your constructive feedback.  I hope that we can use it to
> make things better for everyone.
> 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster Self Heal

2013-07-10 Thread Vijay Bellur


On 07/10/2013 01:31 PM, Toby Corkindale wrote:

On 09/07/13 18:17, 符永涛 wrote:

Hi Toby,

What's the bug #? I want to have a look and backport it to our
production server if it helps. Thank you.


I think it was this one:
https://bugzilla.redhat.com/show_bug.cgi?id=947824

The bug being that the daemons were crashing out if you had a lot of
volumes defined, I think?



A lot of volumes or a lot of delta to self-heal could trigger this crash.

3.3.2 containing this fix should  be out real soon now. Appreciate your 
patience in this regard.


Thanks,
Vijay

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

2013-07-10 Thread Jeff Darcy


On 07/10/2013 07:01 AM, Allan Latham wrote:

I have a simple scenario and it just simply doesn't work. Reading over
the network when the file is available locally is plainly wrong. Our
application cannot take the performance hit nor the extra network traffic.


Another victim of our release process.  :(  Code was added to choose the local 
subvolume whenever possible in *June 2012* (commit 0baa65b6).  Further fixes 
and related changes, including a user-submitted patch to force this choice for 
sites with more complex needs, have gone in since then.  None of them have made 
it into a release yet, since 3.4 is still in beta and the changes have not been 
backported into 3.3.anything (including 3.3.1 which I see you were using).  All 
I can offer is an apology.



1. get a simple minimalist configuration working - 2 hosts and
replication only.
2. make it bomb-proof.
2a. it must cope with network failures, random reboots etc.
2b. if it stops it has to auto-recover quickly.
2c. if it can't it needs thorough documentation and adequate logs so a
reasonable sysop can rescue it.


This is one of my own pet peeves.  I will personally be working on the 
internals documentation soon, so users will at least have a chance of 
understanding what the often-cryptic log messages really mean.  Improvements to 
logging, event reporting, and so on are also ongoing, albeit slowly and not 
under my direct purview.



2d. it needs a fast validation scanner which verifies that data is where
it should be and is identical everywhere (md5sum).


How fast is fast?  What would be an acceptable time for such a scan on a volume 
containing (let's say) ten million files?



3. make it efficient (read local whenever possible - use rsync
techniques - remove scalability obstacles so it doesn't get
exponentially slower as more files are replicated)


Can you explain "exponentially"?  The time for a full scan should increase 
*linearly* with number of files.  That's bad enough, and it's why we're 
starting to get away from reliance on full scans in favor of logging or 
journaling approaches, but if you're seeing exponential behavior then something 
is amiss.



4. when that works expand to multiple hosts and clever distribution
techniques.


That would be a fine sentiment for a new project, but it's not really an option 
when there are already thousands of users relying on the "clever distribution 
techniques" and many other features in production.  We do have to fix their 
bugs too, so we can't devote all of our resources to improving or 
reimplementing replication.  Believe me, I wish we could.


Thank you for your constructive feedback.  I hope that we can use it to make 
things better for everyone.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Giving up [ was: Re: read-subvolume]

Oh wow, it sounds like we both have similar issues.  Surely there is a key 
somewhere to making these simple cases work.  Otherwise, how would some of the 
big organizations using this stuff continue with it?  

- Greg

___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Brian, I'm not ready to give up just yet.  

>From Rejy:

>  Would not the mount option 'backupvolfile-server= help 
> at mount time, in the case of the primary server not being available ?

Hmmm - this seems to be a step in the right direction.  On both nodes I did:

umount /firewall-scripts

Then on fw1:

[root@chicago-fw1 gregs]# mount -t glusterfs -o 
backupvolfile-server=192.168.253.2 192.168.253.1:/firewall-scripts 
/firewall-scripts

And on fw2:

[root@chicago-fw2 ~]#  mount -t glusterfs -o backupvolfile-server=192.168.253.1 
192.168.253.2:/firewall-scripts /firewall-scripts

For the test I just ran,  each node still uses its local copy first.  For my 
application, I'm not super concerned about conflicts between one directory and 
the other because my /firewall-scripts directory will be read-mostly when this 
is in production.  And as part of my startup, the node with the lowest IP 
Address takes itself offline for a few  seconds so the other node detects it's 
down and can assume the primary role.  That's what put me on to this Gluster 
behavior in the first place - fw2 could not find its script to take control 
even though a copy of it was sitting right there on its local disk. 

Anyway, this time with the file system mounted as above, I took fw1 offline and 
from fw2 did, "ls /firewall-scripts".  This time fw2 waited several seconds and 
then showed me the directory listing instead of blowing up with an error.   
Which seems strange to me since I told fw2 that fw1 is its backupvolfile-server 
and fw1 went offline.  So the behavior is definitely not intuitive.  

One other detail that may be relevant - I take fw1 offline by inserting a 
firewall rule that does a REJECT on that interface.  That probably explains the 
"Connection refused" message in the log extract below.   I can try a different 
test, changing the rule to DROP so it really really is offline and see what 
happens.

The log on fw2 looks a little different this time.  This tail was taken after 
doing an ls from fw2.  Pranith - is this the log you mean?  If so, I can do the 
tests again and keep a tail -f in a different window when the other node goes 
offline, so we catch the messages right at that event.  Will this be helpful?  
I can send tarballs of the whole log file, but it's huge and finding the key 
messages seems like a needle in a haystack.  

[root@chicago-fw2 ~]# tail /var/log/glusterfs/firewall-scripts.log -f
[2013-07-10 10:37:59.446481] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (Connection reset by peer)
[2013-07-10 10:37:59.446558] W [socket.c:1962:__socket_proto_state_machine] 
0-firewall-scripts-client-0: reading from socket failed. Error (Connection 
reset by peer), peer (192.168.253.1:49152)
[2013-07-10 10:37:59.447322] E [rpc-clnt.c:368:saved_frames_unwind] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x48) [0x7f8974409b78] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb8) [0x7f8974408028] 
(-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f8974407f4e]))) 
0-firewall-scripts-client-0: forced unwinding frame type(GlusterFS 3.3) 
op(LOOKUP(27)) called at 2013-07-10 10:37:33.563280 (xid=0x24x)
[2013-07-10 10:37:59.447378] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-firewall-scripts-client-0: remote operation failed: Transport endpoint is not 
connected. Path: / (----0001)
[2013-07-10 10:37:59.447716] E [rpc-clnt.c:368:saved_frames_unwind] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x48) [0x7f8974409b78] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb8) [0x7f8974408028] 
(-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f8974407f4e]))) 
0-firewall-scripts-client-0: forced unwinding frame type(GlusterFS Handshake) 
op(PING(3)) called at 2013-07-10 10:37:35.949434 (xid=0x25x)
[2013-07-10 10:37:59.447754] W [client-handshake.c:276:client_ping_cbk] 
0-firewall-scripts-client-0: timer must have expired
[2013-07-10 10:37:59.447821] I [client.c:2097:client_rpc_notify] 
0-firewall-scripts-client-0: disconnected
[2013-07-10 10:38:09.963388] E [socket.c:2157:socket_connect_finish] 
0-firewall-scripts-client-0: connection to 192.168.253.1:24007 failed 
(Connection refused)
[2013-07-10 10:38:09.963493] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:19.988428] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:53.044399] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:54.999683] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:38:58.010774] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:04.028362] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-10 10:39:07.03303

[Gluster-users] Giving up [ was: Re: read-subvolume]