Re: [Gluster-devel] On Gluster resiliency

2017-01-01 Thread Raghavendra G
Ivan,

That's good to hear. Thanks for posting :)

regards,
Raghavendra

On Fri, Dec 23, 2016 at 10:10 PM, Ivan Rossi  wrote:

> Last few days has been tense because a R3 3.8.5 Gluster cluster that I
> built has been plagued by problems.
>
> The first symptom has been a continuous stream in the client logs of:
>
> [2016-12-17 15:55:02.047508] E [MSGID: 108009]
> [afr-open.c:187:afr_openfd_fix_open_cbk]
> 0-hisap-prod-1-replicate-0: Failed to open
> /home/galaxy/HISAP/java/lib/java/jre1.7.0_51/jre/lib/rt.jar on subvolume
> hisap-prod-1-client-2 [Transport endpoint is not connected]
>
> followed by very frequent peer disconnections/reconnections and a
> continuous stream of files to be healed on several volumes.
>
> The problem has been traced back to a flaky X540-T2 10GBE NIC embedded
> in one of the peers motherboard, that was incapable of keeping the
> correct 10Gbit speed negotiation with the switch.
>
> The motherboard has been replaced on the peer. and then the volumes
> healed quickly to complete health.  All of these while the users kept
> running some heavy-duty bioinformatics applications (NGS data
> analysis) on top of Gluster.  No user noticed ANYTHING despite a major
> hardware problem and offi-lining of a peer.
>
> This is a RESILIENT system, in my book.
>
> Gluster people, despite the constant stream of problems and requests
> for help that you see on the ML and IRC, rest assured that you are
> building a nice piece of software, at least IMHO.
>
> Keep-up the good work and Merry Christmas.
>
> Ivan Rossi
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] On Gluster resiliency

2016-12-24 Thread Ivan Rossi
Last few days has been tense because a R3 3.8.5 Gluster cluster that I
built has been plagued by problems.

The first symptom has been a continuous stream in the client logs of:

[2016-12-17 15:55:02.047508] E [MSGID: 108009]
[afr-open.c:187:afr_openfd_fix_open_cbk]
0-hisap-prod-1-replicate-0: Failed to open
/home/galaxy/HISAP/java/lib/java/jre1.7.0_51/jre/lib/rt.jar on subvolume
hisap-prod-1-client-2 [Transport endpoint is not connected]

followed by very frequent peer disconnections/reconnections and a
continuous stream of files to be healed on several volumes.

The problem has been traced back to a flaky X540-T2 10GBE NIC embedded
in one of the peers motherboard, that was incapable of keeping the
correct 10Gbit speed negotiation with the switch.

The motherboard has been replaced on the peer. and then the volumes
healed quickly to complete health.  All of these while the users kept
running some heavy-duty bioinformatics applications (NGS data
analysis) on top of Gluster.  No user noticed ANYTHING despite a major
hardware problem and offi-lining of a peer.

This is a RESILIENT system, in my book.

Gluster people, despite the constant stream of problems and requests
for help that you see on the ML and IRC, rest assured that you are
building a nice piece of software, at least IMHO.

Keep-up the good work and Merry Christmas.

Ivan Rossi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel