bombing out writing large scratch files

Matthew Knepley Sun, 28 May 2006 09:56:27 -0500

Arrghhh, some bad compilers use SIGUSR1 to communicate with themselves. I
have
had this. Just keep typing 'cont' until the SEGV.


   Matt

On 5/28/06, Randall Mackie <randy at geosystem.us> wrote:
>
> Satish,
>
> Yes, PETSc was compiled in debug mode.
>
> Since I'm simply storing vectors in a temporary file, could I get
> around this by using VecView and writing each vector to the
> same Viewer in binary format, then reading them later?
>
> In other words:
>
>         do loop=1,n
>
>           call VecView (xvec(:,loop).....)
>
>         end do
>
>
>
> then later
>
>
>         do loop=1,n
>
>           call VecLoad (xvec(:,loop)....)
>
>         end do
>
>
> Randy
>
> ps. I'll try your other suggestions as well. However, this code has worked
> flawlessly until now, with a model much much larger than I've used in the
> past.
>
>
>
> Satish Balay wrote:
> >  - Not sure what SIGUSR1 means in this context.
> >
> >  - The stack doesn't show any PETSc/user code. Was
> >  this code compiled with debug version of PETSc?
> >
> > - it could be that gdb is unable to look at intel compilers stack
> >   [normally gdb should work]. If thats the case - you could run with
> >   '-start_in_debugger idb']
> >
> > - It appears that this breakage is from usercode which calls fortran
> >   I/O [for_write_dir_xmit()]. There is no fortran I/O from PETSc side
> >   of the code. I think it could still be a bug in the usercode.
> >
> > However PETSc does try to detect the availability of
> > _intel_fast_memcpy() and use it from C side. I don't think this is the
> > cause. But to verify you could remove the flag
> > PETSC_HAVE__INTEL_FAST_MEMCPY from petscconf.h and rebuild libraries.
> >
> > Satish
> >
> >
> > On Sun, 28 May 2006, Randall Mackie wrote:
> >
> >> Satish,
> >>
> >> Thanks, using method (2) worked. However, when I run a bt in gdb,
> >> I get the following output:
> >>
> >> Loaded symbols for /lib/libnss_files.so.2
> >> 0x080b2631 in d3inv_3_3 () at d3inv_3_3.F:2063
> >> 2063          call VecAssemblyBegin(xyz,ierr)
> >> (gdb) cont
> >> Continuing.
> >>
> >> Program received signal SIGUSR1, User defined signal 1.
> >> [Switching to Thread 1082952160 (LWP 23496)]
> >> 0x088cd729 in _intel_fast_memcpy.J ()
> >> Current language:  auto; currently fortran
> >> (gdb) bt
> >> #0  0x088cd729 in _intel_fast_memcpy.J ()
> >> #1  0x40620628 in for_write_dir_xmit ()
> >>    from /opt/intel_fc_80/lib/libifcore.so.5
> >> #2  0xbfffa6b0 in ?? ()
> >> #3  0x00000008 in ?? ()
> >> #4  0xbfff986c in ?? ()
> >> #5  0xbfff9890 in ?? ()
> >> #6  0x406873a8 in __dtors_list_end () from
> /opt/intel_fc_80/lib/libifcore.so.5
> >> #7  0x00000002 in ?? ()
> >> #8  0x00000000 in ?? ()
> >> (gdb)
> >>
> >> This all makes me think this is an INTEL compiler bug, and has nothing
> to
> >> do with my code.
> >>
> >> Any ideas?
> >>
> >> Randy
> >>
> >>
> >> Satish Balay wrote:
> >>> Looks like you have direct access to all the cluster nodes. Perhaps
> >>> you have admin access? You can do either of the following:
> >>>
> >>>  * if the cluster frontend/compute nodes have common filesystem [i.e
> >>>  all machines can see the same file for ~/.Xauthority] and you can get
> >>>  'sshd' settings on the frontend changed - then:
> >>>
> >>>  - configure sshd with 'X11UseLocalhost no' - this way xterms on the
> >>>    compute-nodes can connect to the 'ssh-x11' port on the frontend  -
> run
> >>> the PETSc app with: '-display frontend:ssh-x11-port'
> >>>
> >>>  * However if the above is not possible - but you can ssh directly to
> >>>   all the the compute nodes [perhaps from the frontend] then you can
> >>>   cascade x11 forwarding with:
> >>>
> >>>  - ssh from desktop to frontend
> >>>  - ssh from frontend to node-9 [if you know which machine is node9
> >>>    from the machine file.]
> >>>  - If you don't know which one is the node-9 - then ssh from frontend
> >>>    to all the nodes :). Mostlikely all nodes will get a display
> >>> 'localhost:l0.0'
> >>>  - so now you can run the executable with the option
> >>>        -display localhost:10.0
> >>>
> >>> The other alternative that might work [for interactive runs] is:
> >>>
> >>> -start_in_debugger noxterm -debugger_nodes 9
> >>>
> >>> Satish
> >>>
> >>> On Sat, 27 May 2006, Randall Mackie wrote:
> >>>
> >>>> I can't seem to get the debugger to pop up on my screen.
> >>>>
> >>>> When I'm logged into the cluster I'm working on, I can
> >>>> type xterm &, and an xterm pops up on my display. So I know
> >>>> I can get something from the remote cluster.
> >>>>
> >>>> Now, when I try this using PETSc, I'm getting the following error
> >>>> message, for example:
> >>>>
> >>>>
> ------------------------------------------------------------------------
> >>>> [17]PETSC ERROR: PETSC: Attaching gdb to
> >>>> /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc of pid 3628 on display
> >>>> 24.5.142.138:0.0 on machine compute-0-23.local
> >>>>
> ------------------------------------------------------------------------
> >>>>
> >>>> I'm using this in my command file:
> >>>>
> >>>> source ~/.bashrc
> >>>> time /opt/mpich/intel/bin/mpirun -np 20 -nolocal -machinefile
> machines \
> >>>>          /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc \
> >>>>          -start_in_debugger \
> >>>>          -debugger_node 1 \
> >>>>          -display 24.5.142.138:0.0 \
> >>>>          -em_ksp_type bcgs \
> >>>>          -em_sub_pc_type ilu \
> >>>>          -em_sub_pc_factor_levels 8 \
> >>>>          -em_sub_pc_factor_fill 4 \
> >>>>          -em_sub_pc_factor_reuse_ordering \
> >>>>          -em_sub_pc_factor_reuse_fill \
> >>>>          -em_sub_pc_factor_mat_ordering_type rcm \
> >>>>          -divh_ksp_type cr \
> >>>>          -divh_sub_pc_type icc \
> >>>>          -ppc_sub_pc_type ilu \
> >>>> << EOF
> >>
> >
>
> --
> Randall Mackie
> GSY-USA, Inc.
> PMB# 643
> 2261 Market St.,
> San Francisco, CA 94114-1600
> Tel (415) 469-8649
> Fax (415) 469-5044
>
> California Registered Geophysicist
> License No. GP 1034
>
>


-- 
"Failure has a thousand explanations. Success doesn't need one" -- Sir Alec
Guiness
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20060528/2ac0ae93/attachment.htm>

bombing out writing large scratch files

Reply via email to