Hoang-Nam Nguyen wrote: > Hi Troy! > >> The netpipe code is available with mercurial by: >> hg clone http://source.scl.ameslab.gov/hg/netpipe3-pvfs-dev >> Once you have pvfs2-1.5.1 installed, you should be able to do 'make >> pvfs' in the netpipe3-pvfs-dev directory and build NPpvfs. >> The command line arguments I used to reproduce this were: >> ./NPpvfs -d $PVFS_FILE_PATH -l 32768 -u 268435456 -n 100 -o >> $NETPIPE_OUTPUT_FILE >> > Thanks for this. I've been struggling with setting up the systems > to recreate this problem. Please be patient. > Can you please send me the ouput of modinfo ib_ehca (or hcad_mod > in older version)? Also the firmware code level as plained in > previous email. How many memory have you assigned to the partition? > With those data I'd be able to have nearly the same envs like yours. > >> This is the dmesg log: >> PU0001 000e0091:ehca_hcall_7arg_7ret HCAD_ERROR opcode=160 >> ret=fffffffffffffff7 arg1=1000000003000004 arg2=5 arg3=4000f830000 >> arg4=10000 arg5=e0000000000000 arg6=eb6b6920 arg7=0 out1=0 out2=0 >> out3=0 out4=0 out5=0 out6=0 out7=0 >> PU0001 00090454:ehca_reg_mr HCAD_ERROR hipz_alloc_mr failed, >> h_ret=fffffffffffffff7 hca_hndl=1000000003000004 >> PU0001 00090478:ehca_reg_mr <<< ret=ffffffea shca=c0000000e796b000 >> e_mr=c0000000ce865e80 iova_start=000004000f830000 size=10000 acl=7 >> e_pd=c0000000eb6b6920 pginfo=c0000000dfcb3a70 num_pages=10 num_4k=10 >> PU0001 00090176:ehca_reg_user_mr <<< rc=ffffffffffffffea >> pd=c0000000eb6b6920 region=c0000000ce861dd0 mr_access_flags=7 >> udata=c0000000dfcb3ba0 >> > I got this already from you and Kyle. I meant the full log with > debug traces enabled: modprobe ib_ehca debug_level=1 or for older > versions modprobe hcad_mod debug_level=9999999999999999999999. If > possible, try to get it. Anyway I'll do that with my test env. > Thanks! > Nam > > > I believe we have 8GB allocated on each this box(all memory and cpus allocated to one partition ), and we're running firmware version SF240_233.
p5l5:~# modinfo hcad_mod filename: /lib/modules/2.6.17/kernel/drivers/infiniband/hw/ehca/hcad_mod.ko version: SVNEHCA_0009 description: IBM eServer HCA InfiniBand Device Driver author: Christoph Raisch <[EMAIL PROTECTED]> license: Dual BSD/GPL srcversion: 2B35F7963CEB9E6067F3F92 depends: ib_core vermagic: 2.6.17 SMP mod_unload gcc-4.0 parm: open_aqp1:AQP1 on startup (0: no (default), 1: yes) (int) parm: debug_level:debug level (0: node, 6: only errors (default), 9: all) (int) parm: hw_level:hardware level (0: autosensing (default), 1: v. 0.20, 2: v. 0.21) (int) parm: nr_ports:number of connected ports (default: 2) (int) parm: use_hp_mr:high performance MRs (0: no (default), 1: yes) (int) parm: port_act_time:time to wait for port activation (default: 30 sec) (int) parm: poll_all_eqs:polls all event queues periodically (0: no, 1: yes (default)) (int) parm: static_rate:set permanent static rate (default: disabled) (int) And, setting the debug_level flag definitely caused the server to not respond... I rebooted and tried it again, same thing, setting the debug_level flag causes the server to crash. (I can still login, but cannot execute anything, e.g. 'ls', it seems all the cpu's are spinning) p5l5:~# modprobe hcad_mod nr_ports=1 debug_level=99999999 console output after above command hangs server: PU0003 000e0252:hipz_h_register_rpage >>> adapter_handle=1000000203000004 pagesize=0 queue_type=0 resource_handle=7000000100018600 logical_address_of_page=e6741000 count=200 PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000 arg5=200 arg6=0 arg7=0 PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50 out2=50 out3=50 out4=50 out5=50 out6=50 out7=50 PU0003 000e0263:hipz_h_register_rpage <<< ret=f PU0003 000e04ad:hipz_h_register_rpage_mr <<< ret=f PU0003 0009076c:ehca_set_pagebuf >>> pginfo=c0000000eb7b75e0 type=1 num_pages=1d4000 num_4k=1d4000 next_buf=0 next_4k=30600 number=200 kpage=c0000000e6741000 page_cnt=30600 page_4k_cnt=30600 next_listelem=0 region=0000000000000000 next_chunk=0000000000000000 next_nmap=0 PU0003 00090807:ehca_set_pagebuf <<< ret=0 e_mr=c0000000e1ac2e80 pginfo=c0000000eb7b75e0 type=1 num_pages=1d4000 num_4k=1d4000 next_buf=0 next_4k=30800 number=200 kpage=c0000000e6742000 page_cnt=30800 page_4k_cnt=30800 i=200 next_listelem=0 region=0000000000000000 next_chunk=0000000000000000 next_nmap=0 PU0003 000e049e:hipz_h_register_rpage_mr >>> adapter_handle=1000000203000004 mr=c0000000e1ac2e80 mr_handle=7000000100018600 pagesize=0 queue_type=0 logical_address_of_page=e6741000 count=200 PU0003 000e0252:hipz_h_register_rpage >>> adapter_handle=1000000203000004 pagesize=0 queue_type=0 resource_handle=7000000100018600 logical_address_of_page=e6741000 count=200 PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000 arg5=200 arg6=0 arg7=0 PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50 out2=50 out3=50 out4=50 out5=50 out6=50 out7=50 PU0003 000e0263:hipz_h_register_rpage <<< ret=f <snip, it repeats forever> -- Kyle Schochenmaier [EMAIL PROTECTED] Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general