Hi Christina,
If this error is reproducible, can you enable debugging for the
cancel? In the fs.conf, set:
EventLogging cancel
and restart all the servers, then re-run your test?
This will provide more information about how much of the IO request
is completing before it gets cancelled.
I'm not sure why changing your setup would suddenly cause this error
to occur. Sometimes servers can get overloaded with larger numbers
of clients, causing some of the IO requests to timeout before
completing. You mention that your current setup is with 8 servers
and 8 clients, but the problem only occurs with 16 or 32 instances (2
or 4 processes per node). What was the previous configuration that
did not produce this timeout error?
I didn't realize that the output you were sending was from the server
logs initially. I think you may want to increase the server timeouts
in this case with ServerJobFlowTImeoutSecs.
-sam
On Aug 15, 2007, at 4:41 PM, Christina Patrick wrote:
Hi,
I verified that all the servers are running (atleast the daemons are
running). I checked the server log files and saw that 4 of them
contain the same error messages as the ones pasted below. I checked
the pvfs2 filesystem using the pvfs2-ping command and it seems to be
working fine from all nodes.
I am willing to change the timeout value to something larger. However,
my concern is what may have caused this to start failing now when it
worked perfectly fine prior to this? Is there any way I can check the
health of my servers or find the root cause of this?
Thanks and Warm Regards,
Christina.
On 8/15/07, Sam Lang <[EMAIL PROTECTED]> wrote:
Hi Christina,
Sometimes job timeouts are due to timeout values being set too low
for the particular system, especially with older setups. You can try
to increase the timeouts in the fs.conf (ClientJobFlowTImeoutSecs),
it usually defaults to 300 (5 minutes). It may also be that there
are failures on the servers and they're not returning responses back
to the client. Do you see any messages in the server logs? Can you
verify that the servers are still running after seeing this error?
-sam
On Aug 15, 2007, at 3:17 PM, Christina Patrick wrote:
Hi Everybody,
I have been facing some problems recently when using mpich2 and
pvfs2.
My program worked fine earlier and I did not face any problems
before
while executing those programs. All of a sudden, when I run my
programs now on a reconfigured setup (8 IO servers, 8 clients and 4
metadata servers), I get the below error messages. I have browsed
through the forums and there have been similar reports before.
However, I couldn't really figure out if anybody got a solution
to the
problem. I generally get the error when I scale the number of
instances running to 16 or 32.
6: [E 06:13:46.025685] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 67.
6: [E 06:13:46.025976] fp_multiqueue_cancel: flow proto cancel
called
on 0x8cebcac
6: [E 06:13:46.026004] handle_io_error: flow proto error cleanup
started on 0x8cebcac, error_code: -1610613121
6: [E 06:13:46.026099] handle_io_error: flow proto 0x8cebcac
canceled
1 operations, will clean up.
6: [E 06:13:46.026138] handle_io_error: flow proto 0x8cebcac error
cleanup finished, error_code: -1610613121
11: [E 06:13:46.075671] job_time_mgr_expire: job time out:
cancelling
flow operation, job_id: 71.
11: [E 06:13:46.075994] fp_multiqueue_cancel: flow proto cancel
called
on 0x96f3aac
11: [E 06:13:46.076022] handle_io_error: flow proto error cleanup
started on 0x96f3aac, error_code: -1610613121
11: [E 06:13:46.076117] handle_io_error: flow proto 0x96f3aac
canceled
1 operations, will clean up.
11: [E 06:13:46.076152] handle_io_error: flow proto 0x96f3aac error
cleanup finished, error_code: -1610613121
14: [E 06:19:45.563289] handle_io_error: flow proto error cleanup
started on 0x9c6349c, error_code: -1073741973
I would appreciate any help and suggestions that you'll can offer,
Regards,
Christina.
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers