Hi Roger,

On Dec 8, 2010, at 11:41 AM, Roger Martin wrote:

> Hi Quincey,       [got the 'e' this time]
> 
> The problem cannot be repeated in a c example program because the problem is 
> upstream use of vector's with a -= operation.  The failure in this area did 
> not show except in this mpi application even though the same library and code 
> is used in single process application.
> ...............
> scores[qbChemicalShiftTask->getNMRAtomIndices()[index]]-=qbChemicalShiftTask->getNMRTrace()[index];
> .............
> where scores etc. are std:: vector<double>, std:: vector<int> typedefs.  In 
> certain runs the indices were incorrect so this was badly constructed and 
> needed to be rewritten to have better alignment of indices to the previous 
> creation of the scores vector and better checking.
> 
> These vectors were not used as input to any hdf5 interfaces; hdf5 looks clean 
> and stable in the file closing.  The problem was entirely in non hdf5 code 
> but resulted in stepping on the H5SL skip list from my compile and running of 
> the integrated system.
> 
> Thank you for being ready to look into it if it could be duplicated and shown 
> to be in the H5SL remove area.  The problem wasn't in hdf5 code.

        Ah, that's good to hear, thanks!

                Quincey

> On 12/08/2010 10:12 AM, Roger Martin wrote:
>> Hi Quincy,
>> 
>> I'll be pulling pieces out of the large c++ project into a small test c 
>> program to see if the seg fault can be duplicated in a wieldable example and 
>> if accomplished, will send it to you.
>> 
>> MemoryScape and gdb(Netbeans) doesn't show any memory issues from our 
>> library code and hdf5.  MemoryScape doesn't expand through the H5SL_REMOVE 
>> macro so in another working copy I'm trying to treat a copy of it as a 
>> function.
>> 
>> On 12/07/2010 05:20 PM, Quincey Koziol wrote:
>>> Hi Roger,
>>> 
>>> On Dec 7, 2010, at 2:06 PM, Roger Martin wrote:
>>> 
>>>> Further:
>>>> 
>>>> Debugging with MemoryScape:
>>>> Reveals a segfault in H5SL.c (1.8.5) at line 1068
>>>> ...1068....
>>>>            H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -) 
>>>> //H5SL_TYPE_HADDR case
>>>> ....
>>>> 
>>>> The stack trace is:
>>>> H5SL_remove                     1068
>>>> H5C_flush_single_entry      7993
>>>> H5C_flush_cache                1395
>>>> H5AC_flush                          941
>>>> H5F_flush                           1673
>>>> H5F_dest                              996
>>>> H5F_try_close                     1900
>>>> H5F_close                           1750
>>>> H5I_dec_ref                         1490
>>>> H5F_close                           1951
>>>> 
>>>> I'll be adding print outs to see what variable/pointer is causing the seg 
>>>> fault.  The MemoryScape Fame shows:
>>>> ..............
>>>> Stack Frame
>>>> Function "H5SL_remove":
>>>>  slist:                       0x0b790fc0 (Allocated) ->  (H5SL_t)
>>>>  key:                         0x0b9853f8 (Allocated Interior) ->  
>>>> 0x000000000001affc (110588)
>>>> Block "$b8":
>>>>  _last:                       0x0b772270 (Allocated) ->  (H5SL_node_t)
>>>>  _llast:                      0x0001affc ->  (H5SL_node_t)
>>>>  _next:                       0x0b9855c0 (Allocated) ->  (H5SL_node_t)
>>>>  _drop:                       0x0b772270 (Allocated) ->  (H5SL_node_t)
>>>>  _ldrop:                      0x0b772270 (Allocated) ->  (H5SL_node_t)
>>>>  _count:                      0x00000000 (0)
>>>>  _i:<Bad address: 0x00000000>
>>>> Local variables:
>>>>  x:<Bad address: 0x00000000>
>>>>  hashval:<Bad address: 0x00000000>
>>>>  ret_value:<Bad address: 0x00000000>
>>>>  FUNC:                        "H5SL_remove"
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>> ................
>>>> 
>>>> Some bad addresses on some of the variables such as x which was set by "x 
>>>> = slist->header;" which is a skip list.
>>>> 
>>>> These appear to be internal API functions and I'm wondering how I could be 
>>>> offending them from high level API calls and file interfaces.  What could 
>>>> be in the cache H5C when
>>>> H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
>>>> and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP | 
>>>> H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
>>>> for the file the code is trying to close.
>>>    Yes, you are correct, that shouldn't happen. :-/  Do you have a simple C 
>>> program you can send to show this failure?
>>> 
>>>    Quincey
>>> 
>>>> On 12/03/2010 11:33 AM, Roger Martin wrote:
>>>>> Hi,
>>>>> 
>>>>> Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
>>>>> 
>>>>> 
>>>>> In a case where the hdf5 operations aren't using MPI but build an h5 file 
>>>>> exclusive to individual MPI jobs/processes:
>>>>> 
>>>>> The create:
>>>>> currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, 
>>>>> H5P_DEFAULT);
>>>>> 
>>>>> and many file operations using the hl methods including packet table, 
>>>>> tables and datasets etc. perform successfully.
>>>>> 
>>>>> Then near the individual processes' end the
>>>>> H5Fclose(currentFileID);
>>>>> is called but doesn't return.  A check for open objects says only one 
>>>>> file object is open but no other objects(group, dataset etc).  No other 
>>>>> software or process is acting on this h5; it is named exclusively for the 
>>>>> one job it is associated with.
>>>>> 
>>>>> This isn't a parallel hdf5 in MPI attempt.  In another scenario parallel 
>>>>> hdf5 is working the collective way just fine.  This current issue is for 
>>>>> people who don't have or want a parallel file system and I made a coarsed 
>>>>> grained MPI to run independent jobs for these folks.  Each job has its 
>>>>> own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, 
>>>>> H5P_DEFAULT, H5P_DEFAULT);
>>>>> 
>>>>> Where should I look?
>>>>> 
>>>>> I'll try to make a small example test case for show and tell.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> Hdf-forum@hdfgroup.org
>>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> Hdf-forum@hdfgroup.org
>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum@hdfgroup.org
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>> 
>> 
>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>> 
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to