Re: how to handle (and get out victorious) 1 minute disconnection againts a target.

2013-04-04 Thread Alejandro Comisario
And to clarify the below tests, we simulate the disconnection against the 
target with an "iptables" rule to DROP all connections against the target.

On Thursday, April 4, 2013 2:59:21 PM UTC-3, Alejandro Comisario wrote:
>
> Mike, i think something is wrong and i dont seem to know what, because 
> still, im getting I/O errors reported to the filesystem layer after 120 
> seconds, but i dont know if its from the iscsi side or other configuration 
> is involved, this is what we negotiated with the target :
>
> root@DC2-r16-22vms:/var/log# iscsiadm -m node -T 
> iqn.1992-08.com.netapp:sn.1574861693
> # BEGIN RECORD 2.0-871
> node.name = iqn.1992-08.com.netapp:sn.1574861693
> node.tpgt = 2000
> node.startup = automatic
> iface.hwaddress = 
> iface.ipaddress = 
> iface.iscsi_ifacename = default
> iface.net_ifacename = 
> iface.transport_name = tcp
> iface.initiatorname = 
> node.discovery_address = 10.1.1.160
> node.discovery_port = 3260
> node.discovery_type = send_targets
> node.session.initial_cmdsn = 0
> node.session.initial_login_retry_max = 8
> node.session.xmit_thread_priority = -20
> node.session.cmds_max = 128
> node.session.queue_depth = 32
> node.session.auth.authmethod = None
> node.session.auth.username = 
> node.session.auth.password = 
> node.session.auth.username_in = 
> node.session.auth.password_in = 
> node.session.timeo.replacement_timeout = 600
> node.session.err_timeo.abort_timeout = 15
> node.session.err_timeo.lu_reset_timeout = 20
> node.session.err_timeo.host_reset_timeout = 60
> node.session.iscsi.FastAbort = Yes
> node.session.iscsi.InitialR2T = No
> node.session.iscsi.ImmediateData = Yes
> node.session.iscsi.FirstBurstLength = 262144
> node.session.iscsi.MaxBurstLength = 16776192
> node.session.iscsi.DefaultTime2Retain = 0
> node.session.iscsi.DefaultTime2Wait = 2
> node.session.iscsi.MaxConnections = 1
> node.session.iscsi.MaxOutstandingR2T = 1
> node.session.iscsi.ERL = 0
> node.conn[0].address = 10.1.1.160
> node.conn[0].port = 3260
> node.conn[0].startup = manual
> node.conn[0].tcp.window_size = 524288
> node.conn[0].tcp.type_of_service = 0
> node.conn[0].timeo.logout_timeout = 15
> node.conn[0].timeo.login_timeout = 15
> node.conn[0].timeo.auth_timeout = 45
> node.conn[0].timeo.noop_out_interval = 5
> node.conn[0].timeo.noop_out_timeout = 5
> node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
> node.conn[0].iscsi.HeaderDigest = None
> node.conn[0].iscsi.DataDigest = None
> node.conn[0].iscsi.IFMarker = No
> node.conn[0].iscsi.OFMarker = No
> # END RECORD
>
> We issued a dd against the ext4 LUN mounted, but after 120 seconds we see 
> this errors in /var/log/syslog :
>
> Apr  4 10:33:41 DC2-r16-22vms iscsid: Kernel reported iSCSI connection 3:0 
> error (1011) state (3)
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018502] INFO: task 
> jbd2/sdc1-8:100283 blocked for more than 120 seconds.
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018505] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018507] jbd2/sdc1-8 D 
> 81806240 0 100283  2 0x
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018512]  881015d39ac0 
> 0046 881015d39a60 8103ecf9
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018517]  881015d39fd8 
> 881015d39fd8 881015d39fd8 000137c0
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018521]  881018a7dc00 
> 881016224500 881015d39a90 88203ee14080
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018525] Call Trace:
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018535]  [] 
> ? default_spin_lock_flags+0x9/0x10
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018541]  [] 
> ? __lock_page+0x70/0x70
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018546]  [] 
> schedule+0x3f/0x60
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018548]  [] 
> io_schedule+0x8f/0xd0
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018551]  [] 
> sleep_on_page+0xe/0x20
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018554]  [] 
> __wait_on_bit+0x5f/0x90
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018557]  [] 
> wait_on_page_bit+0x78/0x80
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018563]  [] 
> ? autoremove_wake_function+0x40/0x40
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018566]  [] 
> filemap_fdatawait_range+0x10c/0x1a0
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018570]  [] 
> filemap_fdatawait+0x2b/0x30
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018577]  [] 
> journal_finish_inode_data_buffers+0x70/0x170
> Apr  4 10:37:17 DC2-r16-22vms kernel: [71981.018580]  [] 
> jbd2_journ

Re: how to handle (and get out victorious) 1 minute disconnection againts a target.

2013-04-04 Thread Alejandro Comisario
not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968926]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968929]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968931]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968934]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968936]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968939]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968942]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968945]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968948]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968950]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968954]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968957]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968960]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968963]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968966]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968969]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968981]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968992] sd 8:0:0:1: [sdc] 
Unhandled error code
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.968996] sd 8:0:0:1: [sdc] 
 Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969005]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969011] sd 8:0:0:1: [sdc] CDB: 
Write(10): 2a 00 00 1d 38 00 00 04 00 00
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969028] end_request: 
recoverable transport error, dev sdc, sector 1914880
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969034]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969037]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969040] Buffer I/O error on 
device sdc1, logical block 239104
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969045]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969047] Buffer I/O error on 
device sdc1, logical block 239105
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969051] Buffer I/O error on 
device sdc1, logical block 239106
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969056]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969063]  session3: 
iscsi_queuecommand iscsi: cmd 0x2a is not queued (983040)
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969067] Buffer I/O error on 
device sdc1, logical block 239107
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969070] Buffer I/O error on 
device sdc1, logical block 239108
Apr  4 10:43:42 DC2-r16-22vms kernel: [72365.969074] Buffer I/O error on 
device sdc1, logical block 239109

Do you see whats weird ? need to be configured ? or is missconfigured ?

On Tuesday, April 2, 2013 3:14:14 AM UTC-3, Mike Christie wrote:
>
>
> On Mar 29, 2013, at 4:30 PM, Alejandro Comisario 
> > 
> wrote: 
> > node.session.timeo.replacement_timeout = 120 
> > 
> > Mean that after trying every 5 secs a ping against the target, and after 
> having no reply in 5 secs, it will trigger the HR that will wait ( and 
> queue cmds ) replacement_timeout secconds before failing to the upper 
> layers. 
> > So, just increasing replacement_timeout seconds, i will get the desired 
> behavior ? 
>
> Yes. Just set the replacement_timeout to how long you want the iscsi layer 
> to reconnect before it fails IO to upper layers.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this g

Re: how to handle (and get out victorious) 1 minute disconnection againts a target.

2013-03-29 Thread Alejandro Comisario
Hi mike, sorry for the incompletion on my post.
No, im not using multipath, each physical node is connected directly to a 
lun in a storage.
What i want i think, is the iscsi layer to retry all cmd till we reconnect, 
so that the virtual machines inside that host that are ussing those luns, 
just see I/O wait increasing till the reconnection is made.

i have this params in the initiator side :

node.startup = automatic
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 20
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.session.iscsi.FastAbort = Yes

As i read, having :

node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.timeo.replacement_timeout = 120

Mean that after trying every 5 secs a ping against the target, and after 
having no reply in 5 secs, it will trigger the HR that will wait ( and 
queue cmds ) replacement_timeout secconds before failing to the upper 
layers.
So, just increasing replacement_timeout seconds, i will get the desired 
behavior ?
I just want to handle a 5 min / 10 min network / storage outage without my 
luns going READ-ONLY on the vms side because i have failed all the commands.

Thank you !

On Friday, March 29, 2013 4:40:31 PM UTC-3, Mike Christie wrote:
>
> On 03/27/2013 07:17 PM, alejandro...@gmail.com  wrote: 
> > Hi guys, im new to the list, and i apologise in advance to write this, 
> > but i've been reading a lot and i dont seem to find an answer to an 
> > specific question i have. 
> > 
> > We are using Netapp to deploy nova volume to our openstack KVM 
> > instances, this means : 
> > 
> > #1 We have lots of phisical servers running kvm virtual instances 
> > #2 this physical servers connect against our netapp appliances through 
> iSCSI 
> > #3 we attach this iscsi sessions to the instances for them to see it as 
> > a new block device ( the instances dont see it as an scsi session, just 
> > the phisical server ) 
> > 
> > The thing is, we want to be able to handle a 1 minute disconnection 
> > caused either by a storage reboot or a network outage ( both, no more 
> > than a minute o minute and a half ) 
>
> What do you mean by handle? Retried in the iscsi/scsi layer? Failed to 
> upper levels like multipath or some clustering software? 
>
> Are you using dm-multipath with iscsi? 
>
> Section 8. Advanced Configuration of the iscsi README might be helpful. 
>
> > What i want to understand and i dont seem to ( again, sorry ) is ... 
> > 
>
> Tell me if you are using dm-multipath and what you want to do. I can 
> then answer the questions below. 
>
> > #1 what parameter/s to touch from the open-iscsi config on the physical 
> > host to handle that amount of time the dissconection 
> > #2 in the mean time, supposing i've touched thos parameters, all the 
> > data that needs to be writen and wont, were is it cached on the physical 
> > server side? in ram ? in our local disk ? 
> > #3 and those retries, i imagine i will see exactly the same in the 
> > physical and virtual side, are reflected till the connection is 
> > reestablished, as I/O wait just increasing, and CPU waiting for the 
> > write to finish ? 
> > 
> > Hope i made myself clear, and sorri if all my questions were answered 
> > and i wasnt able to find it. 
> > I'll wait for all your help to understand a little more. 
> > 
> > Thank you very much. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "open-iscsi" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> > an email to open-iscsi+...@googlegroups.com . 
> > To post to this group, send email to 
> > open-...@googlegroups.com. 
>
> > Visit this group at http://groups.google.com/group/open-iscsi?hl=en. 
> > For more options, visit https://groups.google.com/groups/opt_out. 
> >   
> >   
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




how to handle (and get out victorious) 1 minute disconnection againts a target.

2013-03-28 Thread alejandro . comisario
Hi guys, im new to the list, and i apologise in advance to write this, but 
i've been reading a lot and i dont seem to find an answer to an specific 
question i have.

We are using Netapp to deploy nova volume to our openstack KVM instances, 
this means :

#1 We have lots of phisical servers running kvm virtual instances
#2 this physical servers connect against our netapp appliances through iSCSI
#3 we attach this iscsi sessions to the instances for them to see it as a 
new block device ( the instances dont see it as an scsi session, just the 
phisical server )

The thing is, we want to be able to handle a 1 minute disconnection caused 
either by a storage reboot or a network outage ( both, no more than a 
minute o minute and a half )
What i want to understand and i dont seem to ( again, sorry ) is ...

#1 what parameter/s to touch from the open-iscsi config on the physical 
host to handle that amount of time the dissconection
#2 in the mean time, supposing i've touched thos parameters, all the data 
that needs to be writen and wont, were is it cached on the physical server 
side? in ram ? in our local disk ?
#3 and those retries, i imagine i will see exactly the same in the physical 
and virtual side, are reflected till the connection is reestablished, as 
I/O wait just increasing, and CPU waiting for the write to finish ?

Hope i made myself clear, and sorri if all my questions were answered and i 
wasnt able to find it.
I'll wait for all your help to understand a little more.

Thank you very much.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.