Hi Randy,

It looks like maybe one of the entries in the attributes db is damaged, but the error output doesn't give much detail.

Any chance you could apply the attached patch and show the log output again? It doesn't change the behavior other than to print out more error messages in the cases that (I think) you are hitting.

thanks,
-Phil

Randall Martin wrote:
I noticed a similar thread where someone ran a fsck and recovered. I tried a fsck with no luck. I ran db_verify on all of the .db files and it didn’t show anything. Below is the debug output of the server:

[D 06/29 15:29] Passing tcp://oss004-4:3337 as BMI listen address.
[D 06/29 15:29] BMI_tcp_initialize: Initializing TCP/IP module.
[D 06/29 15:29] BMI_tcp_initialize: TCP/IP module successfully initialized.
[D 06/29 15:29] Server using shm key hint: 373672738
[D 06/29 15:29] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 11
[D 06/29 15:29] Default socket buffers send:16384 receive:87380
[D 06/29 15:29] Setting socket buffer size for send:0 receive:0
[D 06/29 15:29] Reread socket buffers send:16384 receive:87380
[D 06/29 15:29] [BMI CONTROL]: BMI_set_info: set_info: 0 option: 12
[D 06/29 15:29] Default socket buffers send:16384 receive:87380
[D 06/29 15:29] Setting socket buffer size for send:0 receive:0
[D 06/29 15:29] Reread socket buffers send:16384 receive:87380
[D 06/29 15:29] dbpf_thread_initialize: initialized
[D 06/29 15:29] [SYNC_COALESCE]: dbpf_sync_context_init for context 0 called
[D 06/29 15:29] dbpf_collection_lookup of coll: pvfs2-fs
[D 06/29 15:29] dbpf using default db cache size.
[D 06/29 15:29] dbpf using shm key: 1020239961
[D 06/29 15:29] collection lookup: version is 0.1.4
[D 06/29 15:29] [SYNC_COALESCE]: dbpf_sync_context_init for context 1 called
[D 06/29 15:29] dbpf collection 373672578 - Setting handle timeout to 360000000 microseconds
[D 06/29 15:29] - set handle re-use timeout to 360 seconds (ret=0)
[D 06/29 15:29] dbpf collection 373672578 - Setting cache keywords of attribute cache to dh,
[D 06/29 15:29] Setting dbpf_attr_cache keywords to:
dh,
[D 06/29 15:29] dbpf collection 373672578 - Setting cache size of attribute cache to 511 [D 06/29 15:29] dbpf collection 373672578 - Setting maximum elements of attribute cache to 1024 [D 06/29 15:29] dbpf collection 373672578 - Initialize collection attr. cache
[D 06/29 15:29] There are 1 cacheable keywords registered
[D 06/29 15:29] dbpf_attr_cache_initialize: initialized
[D 06/29 15:29] dbpf collection 373672578 - Setting collection handle ranges to 4323455642275676148-4611686018427387890,8935141660703064036-9223372036854775778
[D 06/29 15:29] op_queue add: 0x9f96380
[D 06/29 15:29] dbpf_thread_function started
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] handle_new_connection: Assigning socket 11 to new method addr.
[D 06/29 15:29] tcp_do_work_recv: Reading header for new op.
[D 06/29 15:29] tcp_do_work_recv: Received new message; mode: 2.
[D 06/29 15:29] tcp_do_work_recv: tag: 5865658
[D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9f96380
[D 06/29 15:29] handle_new_connection: Assigning socket 12 to new method addr.
[D 06/29 15:29] op_queue add: 0x9f9da50
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9f9da50
[D 06/29 15:29] op_queue add: 0x9fa63d0
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fa63d0
[D 06/29 15:29] op_queue add: 0x9fad360
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fad360
[D 06/29 15:29] op_queue add: 0x9fb0bf0
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fb0bf0
[D 06/29 15:29] op_queue add: 0x9fb2f90
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fb2f90
[D 06/29 15:29] op_queue add: 0x9fb5ab0
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fb5ab0
[D 06/29 15:29] op_queue add: 0x9fc7a30
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fc7a30
[D 06/29 15:29] op_queue add: 0x9fca500
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fca500
[D 06/29 15:29] op_queue add: 0x9fca690
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fca690
[D 06/29 15:29] op_queue add: 0x9fe1980
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) [D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: 1)
[D 06/29 15:29] op_queue add: 0x9fe1980
[D 06/29 15:29] op_queue add: 0x9fe2330
[D 06/29 15:29] [DBPF THREAD]: STARTING TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES)
[E 06/29 15:29] dbpf_dspace_iterate_handles_op_svc: Invalid argument
[D 06/29 15:29] [DBPF THREAD]: FINISHED TROVE SERVICE ROUTINE (DSPACE_ITERATE_HANDLES) (ret: -1073742095)
[D 06/29 15:29] op_queue add: 0x9fe2330
[D 06/29 15:29] trove_dspace_iterate_handles failed
[E 06/29 15:29] Error adding handle range 4323455642275676148-4611686018427387890,8935141660703064036-9223372036854775778 to filesystem pvfs2-fs
[E 06/29 15:29] Error: Could not initialize server interfaces; aborting.
[E 06/29 15:29] Error: Could not initialize server; aborting.
[D 06/29 15:29] *** server shutdown in progress ***


-Randy

------------------------------------------------------------------------
*From: *Randall Martin <[email protected]>
*Date: *Mon, 29 Jun 2009 14:05:33 -0400
*To: *<[email protected]>
*Subject: *[Pvfs2-users] PVFS server won't start

One of our PVFS servers crashed and now it won’t start back. It was previously working since June 2 until today’s crash. Any ideas on how to fix it? I was running the 2.8.1 released version, but I also tried the HEAD version with no change in symptoms.

From the server log:

[D 06/29 13:49] PVFS2 Server version 2.8.1pre1-2009-06-26-182521 starting.
[E 06/29 13:49] dbpf_dspace_iterate_handles_op_svc: Invalid argument
[E 06/29 13:49] Error adding handle range 4323455642275676148-4611686018427387890,8935141660703064036-9223372036854775778 to filesystem pvfs2-fs
[E 06/29 13:49] Error: Could not initialize server interfaces; aborting.
[E 06/29 13:49] Error: Could not initialize server; aborting.

My config file:


<Defaults>
    UnexpectedRequests 50
    EventLogging none
    EnableTracing no
    LogStamp datetime
    BMIModules bmi_tcp
    FlowModules flowproto_multiqueue
    PerfUpdateInterval 1000
    ServerJobBMITimeoutSecs 30
    ServerJobFlowTimeoutSecs 30
    ClientJobBMITimeoutSecs 300
    ClientJobFlowTimeoutSecs 300
    ClientRetryLimit 60
    ClientRetryDelayMilliSecs 10000
    PrecreateBatchSize 512
    PrecreateLowThreshold 256
</Defaults>

<Aliases>
    Alias oss001-1 tcp://oss001-1:3334
    Alias oss001-2 tcp://oss001-2:3335
    Alias oss001-3 tcp://oss001-3:3336
    Alias oss001-4 tcp://oss001-4:3337

    Alias oss002-1 tcp://oss002-1:3334
    Alias oss002-2 tcp://oss002-2:3335
    Alias oss002-3 tcp://oss002-3:3336
    Alias oss002-4 tcp://oss002-4:3337

    Alias oss003-1 tcp://oss003-1:3334
    Alias oss003-2 tcp://oss003-2:3335
    Alias oss003-3 tcp://oss003-3:3336
    Alias oss003-4 tcp://oss003-4:3337

    Alias oss004-1 tcp://oss004-1:3334
    Alias oss004-2 tcp://oss004-2:3335
    Alias oss004-3 tcp://oss004-3:3336
    Alias oss004-4 tcp://oss004-4:3337
</Aliases>


<ServerOptions>
    Server oss001-1
    StorageSpace /ost1
    LogFile /var/log/pvfs2-server.oss001-1.log
</ServerOptions>
<ServerOptions>
    Server oss001-2
    StorageSpace /ost2
    LogFile /var/log/pvfs2-server.oss001-2.log
</ServerOptions>
<ServerOptions>
    Server oss001-3
    StorageSpace /ost3
    LogFile /var/log/pvfs2-server.oss001-3.log
</ServerOptions>
<ServerOptions>
    Server oss001-4
    StorageSpace /ost4
    LogFile /var/log/pvfs2-server.oss001-4.log
</ServerOptions>


<ServerOptions>
    Server oss002-1
    StorageSpace /ost5
    LogFile /var/log/pvfs2-server.oss002-1.log
</ServerOptions>
<ServerOptions>
    Server oss002-2
    StorageSpace /ost6
    LogFile /var/log/pvfs2-server.oss002-2.log
</ServerOptions>
<ServerOptions>
    Server oss002-3
    StorageSpace /ost7
    LogFile /var/log/pvfs2-server.oss002-3.log
</ServerOptions>
<ServerOptions>
    Server oss002-4
    StorageSpace /ost8
    LogFile /var/log/pvfs2-server.oss002-4.log
</ServerOptions>


<ServerOptions>
    Server oss003-1
    StorageSpace /ost9
    LogFile /var/log/pvfs2-server.oss003-1.log
</ServerOptions>
<ServerOptions>
    Server oss003-2
    StorageSpace /ost10
    LogFile /var/log/pvfs2-server.oss003-2.log
</ServerOptions>
<ServerOptions>
    Server oss003-3
    StorageSpace /ost11
    LogFile /var/log/pvfs2-server.oss003-3.log
</ServerOptions>
<ServerOptions>
    Server oss003-4
    StorageSpace /ost12
    LogFile /var/log/pvfs2-server.oss003-4.log
</ServerOptions>


<ServerOptions>
    Server oss004-1
    StorageSpace /ost13
    LogFile /var/log/pvfs2-server.oss004-1.log
</ServerOptions>
<ServerOptions>
    Server oss004-2
    StorageSpace /ost14
    LogFile /var/log/pvfs2-server.oss004-2.log
</ServerOptions>
<ServerOptions>
    Server oss004-3
    StorageSpace /ost15
    LogFile /var/log/pvfs2-server.oss004-3.log
</ServerOptions>
<ServerOptions>
    Server oss004-4
    StorageSpace /ost16
    LogFile /var/log/pvfs2-server.oss004-4.log
</ServerOptions>

<Filesystem>
    Name pvfs2-fs
    ID 373672578
    RootHandle 1048576
    FileStuffing yes
    <MetaHandleRanges>
        Range oss001-1 3-288230376151711745
        Range oss001-2 288230376151711746-576460752303423488
        Range oss001-3 576460752303423489-864691128455135231
        Range oss001-4 864691128455135232-1152921504606846974
        Range oss002-1 1152921504606846975-1441151880758558717
        Range oss002-2 1441151880758558718-1729382256910270460
        Range oss002-3 1729382256910270461-2017612633061982203
        Range oss002-4 2017612633061982204-2305843009213693946
        Range oss003-1 2305843009213693947-2594073385365405689
        Range oss003-2 2594073385365405690-2882303761517117432
        Range oss003-3 2882303761517117433-3170534137668829175
        Range oss003-4 3170534137668829176-3458764513820540918
        Range oss004-1 3458764513820540919-3746994889972252661
        Range oss004-2 3746994889972252662-4035225266123964404
        Range oss004-3 4035225266123964405-4323455642275676147
        Range oss004-4 4323455642275676148-4611686018427387890
    </MetaHandleRanges>
    <DataHandleRanges>
        Range oss001-1 4611686018427387891-4899916394579099633
        Range oss001-2 4899916394579099634-5188146770730811376
        Range oss001-3 5188146770730811377-5476377146882523119
        Range oss001-4 5476377146882523120-5764607523034234862
        Range oss002-1 5764607523034234863-6052837899185946605
        Range oss002-2 6052837899185946606-6341068275337658348
        Range oss002-3 6341068275337658349-6629298651489370091
        Range oss002-4 6629298651489370092-6917529027641081834
        Range oss003-1 6917529027641081835-7205759403792793577
        Range oss003-2 7205759403792793578-7493989779944505320
        Range oss003-3 7493989779944505321-7782220156096217063
        Range oss003-4 7782220156096217064-8070450532247928806
        Range oss004-1 8070450532247928807-8358680908399640549
        Range oss004-2 8358680908399640550-8646911284551352292
        Range oss004-3 8646911284551352293-8935141660703064035
        Range oss004-4 8935141660703064036-9223372036854775778
    </DataHandleRanges>
    <StorageHints>
        TroveSyncMeta no
        TroveSyncData no
        TroveMethod alt-aio
    </StorageHints>
    <Distribution>
        Name simple_stripe
        Param strip_size
        Value 1048576
    </Distribution>
</Filesystem>


Thanks,
Randy

------------------------------------------------------------------------
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users


------------------------------------------------------------------------

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Index: src/io/trove/trove-dbpf/dbpf-dspace.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/io/trove/trove-dbpf/dbpf-dspace.c,v
retrieving revision 1.163
diff -a -u -p -r1.163 dbpf-dspace.c
--- src/io/trove/trove-dbpf/dbpf-dspace.c	30 Jan 2009 15:41:08 -0000	1.163
+++ src/io/trove/trove-dbpf/dbpf-dspace.c	29 Jun 2009 20:39:39 -0000
@@ -886,6 +886,12 @@ static int dbpf_dspace_iterate_handles_o
             if(sizeof_handle != sizeof(TROVE_handle) ||
                sizeof_attr != sizeof(attr))
             {
+                gossip_err("Error: got handle size %zd when expecting %d\n", 
+                    sizeof_handle, sizeof(TROVE_handle));
+                gossip_err("Error: got attr size %zd when expecting %d\n", 
+                    sizeof_attr, sizeof(attr));
+                if(sizeof_handle == sizeof(TROVE_handle))
+                    gossip_err("Error iterating on handle %llu\n", llu(*(TROVE_handle *)tmp_handle));
                 /* something is wrong with the result */
                 ret = -TROVE_EINVAL;
                 goto return_error;
@@ -916,6 +922,12 @@ static int dbpf_dspace_iterate_handles_o
         if(sizeof_handle != sizeof(TROVE_handle) ||
            sizeof_attr != sizeof(attr))
         {
+            gossip_err("Error: got handle size %zd when expecting %d\n", 
+                sizeof_handle, sizeof(TROVE_handle));
+            gossip_err("Error: got attr size %zd when expecting %d\n", 
+                sizeof_attr, sizeof(attr));
+            if(sizeof_handle == sizeof(TROVE_handle))
+                gossip_err("Error iterating on handle %llu\n", llu(*(TROVE_handle *)tmp_handle));
             /* something is wrong with the result */
             ret = -TROVE_EINVAL;
             goto return_error;
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to