Also, this time files are not the same! root@stor1:~# md5sum /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 32411360c53116b96a059f17306caeda /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
root@stor2:~# md5sum /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 65b8a6031bcb6f5fb3a11cb1e8b1c9c9 /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 2014-08-05 16:33 GMT+03:00 Roman <rome...@gmail.com>: > Nope, it is not working. But this time it went a bit other way > > root@gluster-client:~# dmesg > Segmentation fault > > > I was not able even to start the VM after I done the tests > > Could not read qcow2 header: Operation not permitted > > And it seems, it never starts to sync files after first disconnect. VM > survives first disconnect, but not second (I waited around 30 minutes). > Also, I've got network.ping-timeout: 2 in volume settings, but logs react > on first disconnect around 30 seconds. Second was faster, 2 seconds. > > Reaction was different also: > > slower one: > [2014-08-05 13:26:19.558435] W [socket.c:514:__socket_rwv] 0-glusterfs: > readv failed (Connection timed out) > [2014-08-05 13:26:19.558485] W > [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from > socket failed. Error (Connection timed out), peer (10.250.0.1:24007) > [2014-08-05 13:26:21.281426] W [socket.c:514:__socket_rwv] > 0-HA-fast-150G-PVE1-client-0: readv failed (Connection timed out) > [2014-08-05 13:26:21.281474] W > [socket.c:1962:__socket_proto_state_machine] 0-HA-fast-150G-PVE1-client-0: > reading from socket failed. Error (Connection timed out), peer ( > 10.250.0.1:49153) > [2014-08-05 13:26:21.281507] I [client.c:2098:client_rpc_notify] > 0-HA-fast-150G-PVE1-client-0: disconnected > > the fast one: > 2014-08-05 12:52:44.607389] C > [client-handshake.c:127:rpc_client_ping_timer_expired] > 0-HA-fast-150G-PVE1-client-1: server 10.250.0.2:49153 has not responded > in the last 2 seconds, disconnecting. > [2014-08-05 12:52:44.607491] W [socket.c:514:__socket_rwv] > 0-HA-fast-150G-PVE1-client-1: readv failed (No data available) > [2014-08-05 12:52:44.607585] E [rpc-clnt.c:368:saved_frames_unwind] > (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8) > [0x7fcb1b4b0558] > (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) > [0x7fcb1b4aea63] > (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe) > [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced unwinding frame > type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-08-05 12:52:42.463881 > (xid=0x381883x) > [2014-08-05 12:52:44.607604] W > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-HA-fast-150G-PVE1-client-1: > remote operation failed: Transport endpoint is not connected. Path: / > (00000000-0000-0000-0000-000000000001) > [2014-08-05 12:52:44.607736] E [rpc-clnt.c:368:saved_frames_unwind] > (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8) > [0x7fcb1b4b0558] > (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) > [0x7fcb1b4aea63] > (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe) > [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced unwinding frame > type(GlusterFS Handshake) op(PING(3)) called at 2014-08-05 12:52:42.463891 > (xid=0x381884x) > [2014-08-05 12:52:44.607753] W [client-handshake.c:276:client_ping_cbk] > 0-HA-fast-150G-PVE1-client-1: timer must have expired > [2014-08-05 12:52:44.607776] I [client.c:2098:client_rpc_notify] > 0-HA-fast-150G-PVE1-client-1: disconnected > > > > I've got SSD disks (just for an info). > Should I go and give a try for 3.5.2? > > > > 2014-08-05 13:06 GMT+03:00 Pranith Kumar Karampuri <pkara...@redhat.com>: > > reply along with gluster-users please :-). May be you are hitting 'reply' >> instead of 'reply all'? >> >> Pranith >> >> On 08/05/2014 03:35 PM, Roman wrote: >> >> To make sure and clean, I've created another VM with raw format and goint >> to repeat those steps. So now I've got two VM-s one with qcow2 format and >> other with raw format. I will send another e-mail shortly. >> >> >> 2014-08-05 13:01 GMT+03:00 Pranith Kumar Karampuri <pkara...@redhat.com>: >> >>> >>> On 08/05/2014 03:07 PM, Roman wrote: >>> >>> really, seems like the same file >>> >>> stor1: >>> a951641c5230472929836f9fcede6b04 >>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 >>> >>> stor2: >>> a951641c5230472929836f9fcede6b04 >>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 >>> >>> >>> one thing I've seen from logs, that somehow proxmox VE is connecting >>> with wrong version to servers? >>> [2014-08-05 09:23:45.218550] I >>> [client-handshake.c:1659:select_server_supported_programs] >>> 0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3, Num (1298437), >>> Version (330) >>> >>> It is the rpc (over the network data structures) version, which is not >>> changed at all from 3.3 so thats not a problem. So what is the conclusion? >>> Is your test case working now or not? >>> >>> Pranith >>> >>> but if I issue: >>> root@pve1:~# glusterfs -V >>> glusterfs 3.4.4 built on Jun 28 2014 03:44:57 >>> seems ok. >>> >>> server use 3.4.4 meanwhile >>> [2014-08-05 09:23:45.117875] I [server-handshake.c:567:server_setvolume] >>> 0-HA-fast-150G-PVE1-server: accepted client from >>> stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0 (version: >>> 3.4.4) >>> [2014-08-05 09:23:49.103035] I >>> [server-handshake.c:567:server_setvolume] 0-HA-fast-150G-PVE1-server: >>> accepted client from >>> stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0 (version: >>> 3.4.4) >>> >>> if this could be the reason, of course. >>> I did restart the Proxmox VE yesterday (just for an information) >>> >>> >>> >>> >>> >>> 2014-08-05 12:30 GMT+03:00 Pranith Kumar Karampuri <pkara...@redhat.com> >>> : >>> >>>> >>>> On 08/05/2014 02:33 PM, Roman wrote: >>>> >>>> Waited long enough for now, still different sizes and no logs about >>>> healing :( >>>> >>>> stor1 >>>> # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 >>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000 >>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000 >>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921 >>>> >>>> root@stor1:~# du -sh /exports/fast-test/150G/images/127/ >>>> 1.2G /exports/fast-test/150G/images/127/ >>>> >>>> >>>> stor2 >>>> # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2 >>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000 >>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000 >>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921 >>>> >>>> >>>> root@stor2:~# du -sh /exports/fast-test/150G/images/127/ >>>> 1.4G /exports/fast-test/150G/images/127/ >>>> >>>> According to the changelogs, the file doesn't need any healing. Could >>>> you stop the operations on the VMs and take md5sum on both these machines? >>>> >>>> Pranith >>>> >>>> >>>> >>>> >>>> >>>> 2014-08-05 11:49 GMT+03:00 Pranith Kumar Karampuri <pkara...@redhat.com >>>> >: >>>> >>>>> >>>>> On 08/05/2014 02:06 PM, Roman wrote: >>>>> >>>>> Well, it seems like it doesn't see the changes were made to the volume >>>>> ? I created two files 200 and 100 MB (from /dev/zero) after I disconnected >>>>> the first brick. Then connected it back and got these logs: >>>>> >>>>> [2014-08-05 08:30:37.830150] I >>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>> volfile, continuing >>>>> [2014-08-05 08:30:37.830207] I [rpc-clnt.c:1676:rpc_clnt_reconfig] >>>>> 0-HA-fast-150G-PVE1-client-0: changing port to 49153 (from 0) >>>>> [2014-08-05 08:30:37.830239] W [socket.c:514:__socket_rwv] >>>>> 0-HA-fast-150G-PVE1-client-0: readv failed (No data available) >>>>> [2014-08-05 08:30:37.831024] I >>>>> [client-handshake.c:1659:select_server_supported_programs] >>>>> 0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3, Num (1298437), >>>>> Version (330) >>>>> [2014-08-05 08:30:37.831375] I >>>>> [client-handshake.c:1456:client_setvolume_cbk] >>>>> 0-HA-fast-150G-PVE1-client-0: Connected to 10.250.0.1:49153, attached >>>>> to remote volume '/exports/fast-test/150G'. >>>>> [2014-08-05 08:30:37.831394] I >>>>> [client-handshake.c:1468:client_setvolume_cbk] >>>>> 0-HA-fast-150G-PVE1-client-0: Server and Client lk-version numbers are not >>>>> same, reopening the fds >>>>> [2014-08-05 08:30:37.831566] I >>>>> [client-handshake.c:450:client_set_lk_version_cbk] >>>>> 0-HA-fast-150G-PVE1-client-0: Server lk version = 1 >>>>> >>>>> >>>>> [2014-08-05 08:30:37.830150] I >>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>> volfile, continuing >>>>> this line seems weird to me tbh. >>>>> I do not see any traffic on switch interfaces between gluster servers, >>>>> which means, there is no syncing between them. >>>>> I tried to ls -l the files on the client and servers to trigger the >>>>> healing, but seems like no success. Should I wait more? >>>>> >>>>> Yes, it should take around 10-15 minutes. Could you provide 'getfattr >>>>> -d -m. -e hex <file-on-brick>' on both the bricks. >>>>> >>>>> Pranith >>>>> >>>>> >>>>> >>>>> 2014-08-05 11:25 GMT+03:00 Pranith Kumar Karampuri < >>>>> pkara...@redhat.com>: >>>>> >>>>>> >>>>>> On 08/05/2014 01:10 PM, Roman wrote: >>>>>> >>>>>> Ahha! For some reason I was not able to start the VM anymore, Proxmox >>>>>> VE told me, that it is not able to read the qcow2 header due to >>>>>> permission >>>>>> is denied for some reason. So I just deleted that file and created a new >>>>>> VM. And the nex message I've got was this: >>>>>> >>>>>> Seems like these are the messages where you took down the bricks >>>>>> before self-heal. Could you restart the run waiting for self-heals to >>>>>> complete before taking down the next brick? >>>>>> >>>>>> Pranith >>>>>> >>>>>> >>>>>> >>>>>> [2014-08-05 07:31:25.663412] E >>>>>> [afr-self-heal-common.c:197:afr_sh_print_split_brain_log] >>>>>> 0-HA-fast-150G-PVE1-replicate-0: Unable to self-heal contents of >>>>>> '/images/124/vm-124-disk-1.qcow2' (possible split-brain). Please delete >>>>>> the >>>>>> file from all but the preferred subvolume.- Pending matrix: [ [ 0 60 ] [ >>>>>> 11 0 ] ] >>>>>> [2014-08-05 07:31:25.663955] E >>>>>> [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk] >>>>>> 0-HA-fast-150G-PVE1-replicate-0: background data self-heal failed on >>>>>> /images/124/vm-124-disk-1.qcow2 >>>>>> >>>>>> >>>>>> >>>>>> 2014-08-05 10:13 GMT+03:00 Pranith Kumar Karampuri < >>>>>> pkara...@redhat.com>: >>>>>> >>>>>>> I just responded to your earlier mail about how the log looks. The >>>>>>> log comes on the mount's logfile >>>>>>> >>>>>>> Pranith >>>>>>> >>>>>>> On 08/05/2014 12:41 PM, Roman wrote: >>>>>>> >>>>>>> Ok, so I've waited enough, I think. Had no any traffic on switch >>>>>>> ports between servers. Could not find any suitable log message about >>>>>>> completed self-heal (waited about 30 minutes). Plugged out the other >>>>>>> server's UTP cable this time and got in the same situation: >>>>>>> root@gluster-test1:~# cat /var/log/dmesg >>>>>>> -bash: /bin/cat: Input/output error >>>>>>> >>>>>>> brick logs: >>>>>>> [2014-08-05 07:09:03.005474] I [server.c:762:server_rpc_notify] >>>>>>> 0-HA-fast-150G-PVE1-server: disconnecting connectionfrom >>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0 >>>>>>> [2014-08-05 07:09:03.005530] I >>>>>>> [server-helpers.c:729:server_connection_put] 0-HA-fast-150G-PVE1-server: >>>>>>> Shutting down connection >>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0 >>>>>>> [2014-08-05 07:09:03.005560] I [server-helpers.c:463:do_fd_cleanup] >>>>>>> 0-HA-fast-150G-PVE1-server: fd cleanup on >>>>>>> /images/124/vm-124-disk-1.qcow2 >>>>>>> [2014-08-05 07:09:03.005797] I >>>>>>> [server-helpers.c:617:server_connection_destroy] >>>>>>> 0-HA-fast-150G-PVE1-server: destroyed connection of >>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2014-08-05 9:53 GMT+03:00 Pranith Kumar Karampuri < >>>>>>> pkara...@redhat.com>: >>>>>>> >>>>>>>> Do you think it is possible for you to do these tests on the >>>>>>>> latest version 3.5.2? 'gluster volume heal <volname> info' would give >>>>>>>> you >>>>>>>> that information in versions > 3.5.1. >>>>>>>> Otherwise you will have to check it from either the logs, there >>>>>>>> will be self-heal completed message on the mount logs (or) by observing >>>>>>>> 'getfattr -d -m. -e hex <image-file-on-bricks>' >>>>>>>> >>>>>>>> Pranith >>>>>>>> >>>>>>>> >>>>>>>> On 08/05/2014 12:09 PM, Roman wrote: >>>>>>>> >>>>>>>> Ok, I understand. I will try this shortly. >>>>>>>> How can I be sure, that healing process is done, if I am not able >>>>>>>> to see its status? >>>>>>>> >>>>>>>> >>>>>>>> 2014-08-05 9:30 GMT+03:00 Pranith Kumar Karampuri < >>>>>>>> pkara...@redhat.com>: >>>>>>>> >>>>>>>>> Mounts will do the healing, not the self-heal-daemon. The problem >>>>>>>>> I feel is that whichever process does the healing has the latest >>>>>>>>> information about the good bricks in this usecase. Since for VM >>>>>>>>> usecase, >>>>>>>>> mounts should have the latest information, we should let the mounts >>>>>>>>> do the >>>>>>>>> healing. If the mount accesses the VM image either by someone doing >>>>>>>>> operations inside the VM or explicit stat on the file it should do the >>>>>>>>> healing. >>>>>>>>> >>>>>>>>> Pranith. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 08/05/2014 10:39 AM, Roman wrote: >>>>>>>>> >>>>>>>>> Hmmm, you told me to turn it off. Did I understood something >>>>>>>>> wrong? After I issued the command you've sent me, I was not able to >>>>>>>>> watch >>>>>>>>> the healing process, it said, it won't be healed, becouse its turned >>>>>>>>> off. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2014-08-05 5:39 GMT+03:00 Pranith Kumar Karampuri < >>>>>>>>> pkara...@redhat.com>: >>>>>>>>> >>>>>>>>>> You didn't mention anything about self-healing. Did you wait >>>>>>>>>> until the self-heal is complete? >>>>>>>>>> >>>>>>>>>> Pranith >>>>>>>>>> >>>>>>>>>> On 08/04/2014 05:49 PM, Roman wrote: >>>>>>>>>> >>>>>>>>>> Hi! >>>>>>>>>> Result is pretty same. I set the switch port down for 1st server, >>>>>>>>>> it was ok. Then set it up back and set other server's port off. and >>>>>>>>>> it >>>>>>>>>> triggered IO error on two virtual machines: one with local root FS >>>>>>>>>> but >>>>>>>>>> network mounted storage. and other with network root FS. 1st gave an >>>>>>>>>> error >>>>>>>>>> on copying to or from the mounted network disk, other just gave me >>>>>>>>>> an error >>>>>>>>>> for even reading log.files. >>>>>>>>>> >>>>>>>>>> cat: /var/log/alternatives.log: Input/output error >>>>>>>>>> then I reset the kvm VM and it said me, there is no boot >>>>>>>>>> device. Next I virtually powered it off and then back on and it has >>>>>>>>>> booted. >>>>>>>>>> >>>>>>>>>> By the way, did I have to start/stop volume? >>>>>>>>>> >>>>>>>>>> >> Could you do the following and test it again? >>>>>>>>>> >> gluster volume set <volname> cluster.self-heal-daemon off >>>>>>>>>> >>>>>>>>>> >>Pranith >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-08-04 14:10 GMT+03:00 Pranith Kumar Karampuri < >>>>>>>>>> pkara...@redhat.com>: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 08/04/2014 03:33 PM, Roman wrote: >>>>>>>>>>> >>>>>>>>>>> Hello! >>>>>>>>>>> >>>>>>>>>>> Facing the same problem as mentioned here: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html >>>>>>>>>>> >>>>>>>>>>> my set up is up and running, so i'm ready to help you back >>>>>>>>>>> with feedback. >>>>>>>>>>> >>>>>>>>>>> setup: >>>>>>>>>>> proxmox server as client >>>>>>>>>>> 2 gluster physical servers >>>>>>>>>>> >>>>>>>>>>> server side and client side both running atm 3.4.4 glusterfs >>>>>>>>>>> from gluster repo. >>>>>>>>>>> >>>>>>>>>>> the problem is: >>>>>>>>>>> >>>>>>>>>>> 1. craeted replica bricks. >>>>>>>>>>> 2. mounted in proxmox (tried both promox ways: via GUI and fstab >>>>>>>>>>> (with backup volume line), btw while mounting via fstab I'm unable >>>>>>>>>>> to >>>>>>>>>>> launch a VM without cache, meanwhile direct-io-mode is enabled in >>>>>>>>>>> fstab >>>>>>>>>>> line) >>>>>>>>>>> 3. installed VM >>>>>>>>>>> 4. bring one volume down - ok >>>>>>>>>>> 5. bringing up, waiting for sync is done. >>>>>>>>>>> 6. bring other volume down - getting IO errors on VM guest and >>>>>>>>>>> not able to restore the VM after I reset the VM via host. It says >>>>>>>>>>> (no >>>>>>>>>>> bootable media). After I shut it down (forced) and bring back up, >>>>>>>>>>> it boots. >>>>>>>>>>> >>>>>>>>>>> Could you do the following and test it again? >>>>>>>>>>> gluster volume set <volname> cluster.self-heal-daemon off >>>>>>>>>>> >>>>>>>>>>> Pranith >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Need help. Tried 3.4.3, 3.4.4. >>>>>>>>>>> Still missing pkg-s for 3.4.5 for debian and 3.5.2 (3.5.1 always >>>>>>>>>>> gives a healing error for some reason) >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best regards, >>>>>>>>>>> Roman. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing >>>>>>>>>>> listGluster-users@gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Best regards, >>>>>>>>>> Roman. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best regards, >>>>>>>>> Roman. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> Roman. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> Roman. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Roman. >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Roman. >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Roman. >>>> >>>> >>>> >>> >>> >>> -- >>> Best regards, >>> Roman. >>> >>> >>> >> >> >> -- >> Best regards, >> Roman. >> >> >> > > > -- > Best regards, > Roman. > -- Best regards, Roman.
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users