Phil, that last log is definitely munged - and yeah it does look like two overlapping logs.

It's probably worth looking at Sam's suggestion first, as that has been at the center of a few similar bugs. If that's not it I'm ready to help out debugging the request code if you can get a clean log.

As an asside, someone might want to think about an upgrade to the logging to deal with multiple threads.

Walt

Phil Carns wrote:
Hi all,

I am looking at an I/O problem that I don't completely understand. The setup is that there are 15 servers and 20 clients (all RHEL3 SMP). The clients are running a proprietary application. At the end of the run they each write their share of a data set into a 36 GB file. So each is doing a contiguous write of 36/20 GB of data.

Roughly 80-90% of the time, one or more of the clients and servers hang, though it isn't predictable which one will do it. After quite a bit of digging through logs, it turns out that there is a flow that never completes on the server side. There are two problems contributing here:

1) The flow timeout mechanism doesn't seem to be working right for me (the timer for the flow pops and checks progress one time, then never gets triggered again). This isn't the core problem, but it is causing the flow to get stuck forever and hold up the request scheduler rather than giving up eventually.

2) For this particular flow, the client and server appear to disagree on how much data to exchange. The client sends one 256K buffer and completes the flow. The server receives that 256K buffer but still expects at least one more to arrive (which never does).

I am pretty sure I can get to the bottom of the first problem. The second one is trickier.

I attached a few log file excerpts:

client-1st.txt: This is the request processing log output from the client side when it is working on the flow in question

server-1st.txt: request processing log output from the first iteration of the server side flow

server-2nd.txt: request processing log output from the second iteration of the server side flow

Anyone have any suggestions, or see any indication from those log portions about why the server and client arn't on the same page?

Some of this log output (the server-2nd.txt in particular?) looks a little jumbled, like maybe there are two flows doing request processing at once, but I don't really know how to interpret it.

I may try adding a mutex around PINT_process_request() next, just to see if I can make the logs clearer by making sure only one invocation is going at once.

-Phil


------------------------------------------------------------------------

[D 06/09 15:49] =========================================================
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         tiling 5484 copies
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0 [D 06/09 15:49] lvl 0 el 0 blk 0 by 0
[D 06/09 15:49]                 to 0 ta 22993719939 fi 22997914243
[D 06/09 15:49]                 mto 0 mta 0 mfi 4194304
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 0 contig_size = 23001563136 lvl_flag =
1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49] retval = 22993719939 [D 06/09 15:49] exiting logical skip because distribute indicates done
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0 [D 06/09 15:49] lvl 0 el 0 blk 0 by 22993719939
[D 06/09 15:49]                 to 22993719939 ta 22993719939 fi 22997914243
[D 06/09 15:49]                 mto 0 mta 0 mfi 4194304
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 22993719939 contig_size = 7843197
lvl_flag = 1
[D 06/09 15:49]         calling distribute
[D 06/09 15:49]         PINT_distribute
[D 06/09 15:49]                 of 22993719939 sz 4194304 ix 0 sm 64 by 0 bm
262144 fsz 0 exfl 1
[D 06/09 15:49]                 begin iteration loff: 22994157568
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]                 client lstof 0 diff 437629 sgof 437629
[D 06/09 15:49] **********CALL***PROCESS*********
[D 06/09 15:49]                 segment of 437629 sz 65536
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
[D 06/09 15:49]                 to 0 ta 437629 fi 503165
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 0 contig_size = 4194304 lvl_flag = 1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49]         retval = 437629
[D 06/09 15:49]         exiting logical skip because distribute indicates done
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 437629
[D 06/09 15:49]                 to 437629 ta 437629 fi 503165
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 437629 contig_size = 3756675 lvl_flag
= 1
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 437629 sz 65536
[D 06/09 15:49]         add a segment 0
[D 06/09 15:49]         retval = 65536
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 1 sm 64 by 65536 bm 262144 ta 437629 to 503165
fo 503165 eof 0
[D 06/09 15:49] *****RETURN***FROM***PROCESS*****
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 22995140608
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]                 client lstof 503165 diff 917504 sgof 1420669
[D 06/09 15:49] **********CALL***PROCESS*********
[D 06/09 15:49]                 segment of 1420669 sz 65536
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 503165
[D 06/09 15:49]                 to 503165 ta 1420669 fi 1486205
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 503165 contig_size = 3691139 lvl_flag
= 1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49]         retval = 917504
[D 06/09 15:49]         exiting logical skip because distribute indicates done
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 1420669
[D 06/09 15:49]                 to 1420669 ta 1420669 fi 1486205
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 1420669 contig_size = 2773635 lvl_flag
= 1
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 1420669 sz 65536
[D 06/09 15:49]         add a segment 1
[D 06/09 15:49]         retval = 65536
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 2 sm 64 by 131072 bm 262144 ta 1420669 to
1486205 fo 1486205 eof 0
[D 06/09 15:49] *****RETURN***FROM***PROCESS*****
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 22996123648
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]                 client lstof 1486205 diff 917504 sgof 2403709
[D 06/09 15:49] **********CALL***PROCESS*********
[D 06/09 15:49]                 segment of 2403709 sz 65536
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 1486205
[D 06/09 15:49]                 to 1486205 ta 2403709 fi 2469245
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 1486205 contig_size = 2708099 lvl_flag
= 1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49]         retval = 917504
[D 06/09 15:49]         exiting logical skip because distribute indicates done
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 2403709
[D 06/09 15:49]                 to 2403709 ta 2403709 fi 2469245
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 2403709 contig_size = 1790595 lvl_flag
= 1
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2403709 sz 65536
[D 06/09 15:49]         add a segment 2
[D 06/09 15:49]         retval = 65536
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 3 sm 64 by 196608 bm 262144 ta 2403709 to
2469245 fo 2469245 eof 0
[D 06/09 15:49] *****RETURN***FROM***PROCESS*****
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 22997106688
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]                 client lstof 2469245 diff 917504 sgof 3386749
[D 06/09 15:49] **********CALL***PROCESS*********
[D 06/09 15:49]                 segment of 3386749 sz 65536
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 2469245
[D 06/09 15:49]                 to 2469245 ta 3386749 fi 3452285
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 2469245 contig_size = 1725059 lvl_flag
= 1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49]         retval = 917504
[D 06/09 15:49]         exiting logical skip because distribute indicates done
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 3386749
[D 06/09 15:49]                 to 3386749 ta 3386749 fi 3452285
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 3386749 contig_size = 807555 lvl_flag
= 1
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 3386749 sz 65536
[D 06/09 15:49]         add a segment 3
[D 06/09 15:49]         retval = 65536
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 4 sm 64 by 262144 bm 262144 ta 3386749 to
3452285 fo 3452285 eof 0
[D 06/09 15:49] *****RETURN***FROM***PROCESS*****
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 done with segments or bytes
[D 06/09 15:49]                         of 22997172224 sz 742019 sg 4 sm 64 by
262144 bm 262144
[D 06/09 15:49]                         next loff: 22998089728 next poff:
1533149184
[D 06/09 15:49]                         (return value) 4194304
[D 06/09 15:49]                 finished
[D 06/09 15:49]         retval = 4194304
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 4 sm 64 by 262144 bm 262144 ta 22993719939 to
22997914243 fo 22997914243 eof 0


------------------------------------------------------------------------

[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         tiling 9156 copies
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
[D 06/09 15:49]                 to 0 ta 38396548819 fi 38400743123
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 0 contig_size = 38403047424 lvl_flag =
1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49]         retval = 38396548819
[D 06/09 15:49]         exiting logical skip because distribute indicates done
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 38396548819
[D 06/09 15:49]                 to 38396548819 ta 38396548819 fi 38400743123
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 38396548819 contig_size = 6498605
lvl_flag = 1
[D 06/09 15:49]         calling distribute
[D 06/09 15:49]         PINT_distribute
[D 06/09 15:49]                 of 38396548819 sz 4194304 ix 0 sm 64 by 0 bm
262144 fsz 2559770624 exfl 1
[D 06/09 15:49]                 begin iteration loff: 38396690432
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2559770624 sz 65536
[D 06/09 15:49]         add a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 38397673472
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2559836160 sz 65536
[D 06/09 15:49]         combine a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 38398656512
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2559901696 sz 65536
[D 06/09 15:49]         combine a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 38399639552
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2559967232 sz 65536
[D 06/09 15:49]         combine a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 done with segments or bytes
[D 06/09 15:49]                         of 38399705088 sz 1038035 sg 1 sm 64
by 262144 bm 262144
[D 06/09 15:49]                         next loff: 38400622592 next poff:
2560032768
[D 06/09 15:49]                         return value 3156269
[D 06/09 15:49]                 finished
[D 06/09 15:49]         retval = 3156269
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 1 sm 64 by 262144 bm 262144 ta 38396548819 to
38399705088 fo 38400743123 eof 0


------------------------------------------------------------------------

[D 06/09 15:49] =========================================================
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]         tiling 8489 copies
[D 06/09 15:49]         skipping ahead to target_offset
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 0
[D 06/09 15:49]                 to 0 ta 35599341710 fi 35603536014
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 0 contig_size = 35605446656 lvl_flag =
1
[D 06/09 15:49]         process logical skip
[D 06/09 15:49]         retval = 35599341710
[D 06/09 15:49]         exiting logical skip because distribute indicates done
[D 06/09 15:49] =========================================================
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49] PINT_process_request
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 35599341710
[D 06/09 15:49]         tiling 9156 copies
[D 06/09 15:49]                 to 35599341710 ta 35599341710 fi 35603536014
[D 06/09 15:49]         Do seq of 0 ne 4194304 st 0 nb 1 ub 4194304 lb 0 as
4194304 co 0
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]                 lvl 0 el 0 blk 0 by 38399705088
[D 06/09 15:49]         Request disp
[D 06/09 15:49]         contig_offset = 35599341710 contig_size = 6104946
lvl_flag = 1
[D 06/09 15:49]                 to 38399705088 ta 38396548819 fi 38400743123
[D 06/09 15:49]         calling distribute
[D 06/09 15:49]         PINT_distribute
[D 06/09 15:49]         basic type or contiguous data
[D 06/09 15:49]                 of 35599341710 sz 4194304 ix 0 sm 64 by 0 bm
262144 fsz 2559770624 exfl 1
[D 06/09 15:49]         Request disp
[D 06/09 15:49]                 begin iteration loff: 35599941632
[D 06/09 15:49]         contig_offset = 38399705088 contig_size = 3342336
lvl_flag = 1
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]         process a segment
[D 06/09 15:49]         calling distribute
[D 06/09 15:49]                         of 2373320704 sz 65536
[D 06/09 15:49]         add a segment 0
[D 06/09 15:49]         PINT_distribute
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 35600924672
[D 06/09 15:49]                 of 38399705088 sz 1038035 ix 0 sm 64 by 0 bm
262144 fsz 2560032768 exfl 1
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2373386240 sz 65536
[D 06/09 15:49]         combine a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 begin iteration loff: 38400622592
[D 06/09 15:49]                 begin iteration loff: 35601907712
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2373451776 sz 65536
[D 06/09 15:49]                 file being extended
[D 06/09 15:49]         combine a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                 begin iteration loff: 35602890752
[D 06/09 15:49]                 frag extends beyond strip
[D 06/09 15:49]                         of 2560032768 sz 65536
[D 06/09 15:49]         process a segment
[D 06/09 15:49]                         of 2373517312 sz 65536
[D 06/09 15:49]         add a segment 0
[D 06/09 15:49]         combine a segment 0
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 end iteration
[D 06/09 15:49]                 done with segments or bytes
[D 06/09 15:49]                         of 35602956288 sz 579726 sg 1 sm 64 by
262144 bm 262144
[D 06/09 15:49]                         of 38400688128 sz 54995 sg 1 sm 64 by
65536 bm 262144
[D 06/09 15:49]                         next loff: 35603873792 next poff:
2373582848
[D 06/09 15:49]                         (return value) 4194304
[D 06/09 15:49]                         next loff: 38401605632 next poff:
2560098304
[D 06/09 15:49]                 finished
[D 06/09 15:49]                         (return value) 1038035
[D 06/09 15:49]         retval = 4194304
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]                 finished
[D 06/09 15:49]         done sg 1 sm 64 by 262144 bm 262144 ta 35599341710 to
35603536014 fo 35603536014 eof 0
[D 06/09 15:49]         retval = 1038035
[D 06/09 15:49] =========================================================
[D 06/09 15:49]         exiting distribute returned less than expected
[D 06/09 15:49]         done sg 1 sm 64 by 65536 bm 262144 ta 38396548819 to
38400743123 fo 38400743123 eof 0


------------------------------------------------------------------------

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

--
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to