Re: how to handle (and get out victorious) 1 minute disconnection againts a target.
And to clarify the below tests, we simulate the disconnection against the target with an "iptables" rule to DROP all connections against the target. On Thursday, April 4, 2013 2:59:21 PM UTC-3, Alejandro Comisario wrote: > > Mike, i think something is wrong and i dont seem to know what, because > still, im getting I/O errors reported to the filesystem layer after 120 > seconds, but i dont know if its from the iscsi side or other configuration > is involved, this is what we negotiated with the target : > > root@DC2-r16-22vms:/var/log# iscsiadm -m node -T > iqn.1992-08.com.netapp:sn.1574861693 > # BEGIN RECORD 2.0-871 > node.name = iqn.1992-08.com.netapp:sn.1574861693 > node.tpgt = 2000 > node.startup = automatic > iface.hwaddress = > iface.ipaddress = > iface.iscsi_ifacename = default > iface.net_ifacename = > iface.transport_name = tcp > iface.initiatorname = > node.discovery_address = 10.1.1.160 > node.discovery_port = 3260 > node.discovery_type = send_targets > node.session.initial_cmdsn = 0 > node.session.initial_login_retry_max = 8 > node.session.xmit_thread_priority = -20 > node.session.cmds_max = 128 > node.session.queue_depth = 32 > node.session.auth.authmethod = None > node.session.auth.username = > node.session.auth.password = > node.session.auth.username_in = > node.session.auth.password_in = > node.session.timeo.replacement_timeout = 600 > node.session.err_timeo.abort_timeout = 15 > node.session.err_timeo.lu_reset_timeout = 20 > node.session.err_timeo.host_reset_timeout = 60 > node.session.iscsi.FastAbort = Yes > node.session.iscsi.InitialR2T = No > node.session.iscsi.ImmediateData = Yes > node.session.iscsi.FirstBurstLength = 262144 > node.session.iscsi.MaxBurstLength = 16776192 > node.session.iscsi.DefaultTime2Retain = 0 > node.session.iscsi.DefaultTime2Wait = 2 > node.session.iscsi.MaxConnections = 1 > node.session.iscsi.MaxOutstandingR2T = 1 > node.session.iscsi.ERL = 0 > node.conn[0].address = 10.1.1.160 > node.conn[0].port = 3260 > node.conn[0].startup = manual > node.conn[0].tcp.window_size = 524288 > node.conn[0].tcp.type_of_service = 0 > node.conn[0].timeo.logout_timeout = 15 > node.conn[0].timeo.login_timeout = 15 > node.conn[0].timeo.auth_timeout = 45 > node.conn[0].timeo.noop_out_interval = 5 > node.conn[0].timeo.noop_out_timeout = 5 > node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 > node.conn[0].iscsi.HeaderDigest = None > node.conn[0].iscsi.DataDigest = None > node.conn[0].iscsi.IFMarker = No > node.conn[0].iscsi.OFMarker = No > # END RECORD > > We issued a dd against the ext4 LUN mounted, but after 120 seconds we see > this errors in /var/log/syslog : > > Apr 4 10:33:41 DC2-r16-22vms iscsid: Kernel reported iSCSI connection 3:0 > error (1011) state (3) > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018502] INFO: task > jbd2/sdc1-8:100283 blocked for more than 120 seconds. > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018505] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018507] jbd2/sdc1-8 D > 81806240 0 100283 2 0x > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018512] 881015d39ac0 > 0046 881015d39a60 8103ecf9 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018517] 881015d39fd8 > 881015d39fd8 881015d39fd8 000137c0 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018521] 881018a7dc00 > 881016224500 881015d39a90 88203ee14080 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018525] Call Trace: > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018535] [] > ? default_spin_lock_flags+0x9/0x10 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018541] [] > ? __lock_page+0x70/0x70 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018546] [] > schedule+0x3f/0x60 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018548] [] > io_schedule+0x8f/0xd0 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018551] [] > sleep_on_page+0xe/0x20 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018554] [] > __wait_on_bit+0x5f/0x90 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018557] [] > wait_on_page_bit+0x78/0x80 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018563] [] > ? autoremove_wake_function+0x40/0x40 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018566] [] > filemap_fdatawait_range+0x10c/0x1a0 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018570] [] > filemap_fdatawait+0x2b/0x30 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018577] [] > journal_finish_inode_data_buffers+0x70/0x170 > Apr 4 10:37:17 DC2-r16-22vms kernel: [71981.018580] [] > jbd2_journ
Re: how to handle (and get out victorious) 1 minute disconnection againts a target.
not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968926] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968929] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968931] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968934] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968936] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968939] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968942] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968945] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968948] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968950] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968954] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968957] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968960] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968963] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968966] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968969] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968981] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968992] sd 8:0:0:1: [sdc] Unhandled error code Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.968996] sd 8:0:0:1: [sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969005] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969011] sd 8:0:0:1: [sdc] CDB: Write(10): 2a 00 00 1d 38 00 00 04 00 00 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969028] end_request: recoverable transport error, dev sdc, sector 1914880 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969034] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969037] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969040] Buffer I/O error on device sdc1, logical block 239104 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969045] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969047] Buffer I/O error on device sdc1, logical block 239105 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969051] Buffer I/O error on device sdc1, logical block 239106 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969056] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969063] session3: iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040) Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969067] Buffer I/O error on device sdc1, logical block 239107 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969070] Buffer I/O error on device sdc1, logical block 239108 Apr 4 10:43:42 DC2-r16-22vms kernel: [72365.969074] Buffer I/O error on device sdc1, logical block 239109 Do you see whats weird ? need to be configured ? or is missconfigured ? On Tuesday, April 2, 2013 3:14:14 AM UTC-3, Mike Christie wrote: > > > On Mar 29, 2013, at 4:30 PM, Alejandro Comisario > > > wrote: > > node.session.timeo.replacement_timeout = 120 > > > > Mean that after trying every 5 secs a ping against the target, and after > having no reply in 5 secs, it will trigger the HR that will wait ( and > queue cmds ) replacement_timeout secconds before failing to the upper > layers. > > So, just increasing replacement_timeout seconds, i will get the desired > behavior ? > > Yes. Just set the replacement_timeout to how long you want the iscsi layer > to reconnect before it fails IO to upper layers. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscr...@googlegroups.com. To post to this g
Re: how to handle (and get out victorious) 1 minute disconnection againts a target.
Hi mike, sorry for the incompletion on my post. No, im not using multipath, each physical node is connected directly to a lun in a storage. What i want i think, is the iscsi layer to retry all cmd till we reconnect, so that the virtual machines inside that host that are ussing those luns, just see I/O wait increasing till the reconnection is made. i have this params in the initiator side : node.startup = automatic node.session.timeo.replacement_timeout = 120 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 20 node.session.initial_login_retry_max = 8 node.session.cmds_max = 128 node.session.queue_depth = 32 node.session.xmit_thread_priority = -20 node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768 node.session.iscsi.FastAbort = Yes As i read, having : node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.session.timeo.replacement_timeout = 120 Mean that after trying every 5 secs a ping against the target, and after having no reply in 5 secs, it will trigger the HR that will wait ( and queue cmds ) replacement_timeout secconds before failing to the upper layers. So, just increasing replacement_timeout seconds, i will get the desired behavior ? I just want to handle a 5 min / 10 min network / storage outage without my luns going READ-ONLY on the vms side because i have failed all the commands. Thank you ! On Friday, March 29, 2013 4:40:31 PM UTC-3, Mike Christie wrote: > > On 03/27/2013 07:17 PM, alejandro...@gmail.com wrote: > > Hi guys, im new to the list, and i apologise in advance to write this, > > but i've been reading a lot and i dont seem to find an answer to an > > specific question i have. > > > > We are using Netapp to deploy nova volume to our openstack KVM > > instances, this means : > > > > #1 We have lots of phisical servers running kvm virtual instances > > #2 this physical servers connect against our netapp appliances through > iSCSI > > #3 we attach this iscsi sessions to the instances for them to see it as > > a new block device ( the instances dont see it as an scsi session, just > > the phisical server ) > > > > The thing is, we want to be able to handle a 1 minute disconnection > > caused either by a storage reboot or a network outage ( both, no more > > than a minute o minute and a half ) > > What do you mean by handle? Retried in the iscsi/scsi layer? Failed to > upper levels like multipath or some clustering software? > > Are you using dm-multipath with iscsi? > > Section 8. Advanced Configuration of the iscsi README might be helpful. > > > What i want to understand and i dont seem to ( again, sorry ) is ... > > > > Tell me if you are using dm-multipath and what you want to do. I can > then answer the questions below. > > > #1 what parameter/s to touch from the open-iscsi config on the physical > > host to handle that amount of time the dissconection > > #2 in the mean time, supposing i've touched thos parameters, all the > > data that needs to be writen and wont, were is it cached on the physical > > server side? in ram ? in our local disk ? > > #3 and those retries, i imagine i will see exactly the same in the > > physical and virtual side, are reflected till the connection is > > reestablished, as I/O wait just increasing, and CPU waiting for the > > write to finish ? > > > > Hope i made myself clear, and sorri if all my questions were answered > > and i wasnt able to find it. > > I'll wait for all your help to understand a little more. > > > > Thank you very much. > > > > -- > > You received this message because you are subscribed to the Google > > Groups "open-iscsi" group. > > To unsubscribe from this group and stop receiving emails from it, send > > an email to open-iscsi+...@googlegroups.com . > > To post to this group, send email to > > open-...@googlegroups.com. > > > Visit this group at http://groups.google.com/group/open-iscsi?hl=en. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscr...@googlegroups.com. To post to this group, send email to open-iscsi@googlegroups.com. Visit this group at http://groups.google.com/group/open-iscsi?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
how to handle (and get out victorious) 1 minute disconnection againts a target.
Hi guys, im new to the list, and i apologise in advance to write this, but i've been reading a lot and i dont seem to find an answer to an specific question i have. We are using Netapp to deploy nova volume to our openstack KVM instances, this means : #1 We have lots of phisical servers running kvm virtual instances #2 this physical servers connect against our netapp appliances through iSCSI #3 we attach this iscsi sessions to the instances for them to see it as a new block device ( the instances dont see it as an scsi session, just the phisical server ) The thing is, we want to be able to handle a 1 minute disconnection caused either by a storage reboot or a network outage ( both, no more than a minute o minute and a half ) What i want to understand and i dont seem to ( again, sorry ) is ... #1 what parameter/s to touch from the open-iscsi config on the physical host to handle that amount of time the dissconection #2 in the mean time, supposing i've touched thos parameters, all the data that needs to be writen and wont, were is it cached on the physical server side? in ram ? in our local disk ? #3 and those retries, i imagine i will see exactly the same in the physical and virtual side, are reflected till the connection is reestablished, as I/O wait just increasing, and CPU waiting for the write to finish ? Hope i made myself clear, and sorri if all my questions were answered and i wasnt able to find it. I'll wait for all your help to understand a little more. Thank you very much. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscr...@googlegroups.com. To post to this group, send email to open-iscsi@googlegroups.com. Visit this group at http://groups.google.com/group/open-iscsi?hl=en. For more options, visit https://groups.google.com/groups/opt_out.