On Tue, Jun 24, 2014 at 07:36:49PM -0400, Bhupesh Purandare wrote:
> Hello,
> I am trying to run netperf2 (parallel) in autotest 0.15.1. (Similar results in
> autotest 0.16.0)
What do you mean by "parallel" here?
There is a single control file (control.parallel) to lunch both client
and server & control the tests. Did you _only_ use it?
> But the test keeps failing because the barrier slave keeps sending a Bad
> connection request to master.
> I am trying to setup machine 172.25.43.234 as client and 172.25.43.226 as the
> server in the netperf2(parallel) control file.
>
> The client and server jobs pass on the server machine (172.25.43.226), but
> fail
> on the client machine (172.25.43.234) with an exception:
>
> BarrierError: timeout waiting for barrier: start_1
>
> I am trying to understand why the job fails on the client machine and not on
> the server machine.
>
> Here are the relevant lines from /results/19-debug_user/172.25.43.234/debug/
> client.0.DEBUG (entire log directory is attached)
>
> 06/24 22:51:01 DEBUG|base_utils:0077| Running
> '/ghostcache/autotest/tmp/netperf2/src/src/netserver&'
> 06/24 22:51:01 DEBUG|base_utils:0114| [stdout] Starting netserver with host
> 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> 06/24 22:51:01 INFO |base_barri:0163| tag=start_1 port=11922 timeout=600
> hostid=172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0497| masterid: 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0503| members: 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0510| selected as master: self._hostid =
> 172.25.43.226#netperf-server , self._masterid = 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0163| tag=start_1 port=11922 timeout=600
> hostid=172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0497| masterid: 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0503| members: 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0513| selected as slave
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0406| calling master
> 06/24 22:51:11 INFO |base_barri:0189| seconds remaining: 589
> 06/24 22:51:11 INFO |base_barri:0189| seconds remaining: 589
> 06/24 22:51:11 INFO |base_barri:0406| calling master
> 06/24 22:51:11 INFO |base_barri:0263| master said: !tag
> 06/24 22:51:11 WARNI|base_barri:0267| Bad connection request to master
> 06/24 22:51:11 INFO |base_barri:0189| seconds remaining: 589
> 06/24 22:51:11 INFO |base_barri:0189| seconds remaining: 589
> 06/24 22:51:11 INFO |base_barri:0406| calling master
> 06/24 22:51:11 INFO |base_barri:0263| master said: !tag
> ....
> ....
> ....
> 06/24 22:51:21 WARNI|base_barri:0267| Bad connection request to master
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 579
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 579
> 06/24 22:51:21 INFO |base_barri:0406| calling master
> 06/24 22:51:21 ERROR| parallel:0026| child process failed
> 06/24 22:51:21 DEBUG| parallel:0030| Traceback (most recent call last):
> 06/24 22:51:21 DEBUG| parallel:0030| File
> "/ghostcache/autotest/parallel.py", line 18, in fork_start
> 06/24 22:51:21 DEBUG| parallel:0030| l()
> 06/24 22:51:21 DEBUG| parallel:0030| File "/ghostcache/autotest/job.py",
> line 529, in <lambda>
> ....
> ....
> 06/24 22:51:21 DEBUG| parallel:0030| self._slave_hello(connection)
> 06/24 22:51:21 DEBUG| parallel:0030| File
> "/ghostcache/autotest/shared/base_barrier.py", line 261, in _slave_hello
> 06/24 22:51:21 DEBUG| parallel:0030| reply = client.recv(4)
> 06/24 22:51:21 DEBUG| parallel:0030| error: [Errno 104] Connection reset by
> peer
> 06/24 22:51:21 DEBUG| parallel:0030|
> 06/24 22:51:21 INFO | job:0212| FAIL netperf2.client
> netperf2.client timestamp=1403650281 localtime=Jun 24 22:51:21
> Unhandled error: [Errno 104] Connection reset by peer
> ...
> ...
> ...
> 06/24 23:01:01 DEBUG| parallel:0030| BarrierError: timeout waiting for
> barrier: start_1
>
>
>
> 06/24 23:01:01 DEBUG| parallel:0030| BarrierError: timeout waiting for
> barrier: start_1
>
> 06/24 22:51:01 DEBUG|base_utils:0077| Running 'killall netserver'
> 06/24 22:51:01 DEBUG|base_utils:0077| Running
> '/ghostcache/autotest/tmp/netperf2/src/src/netserver&'
> 06/24 22:51:01 DEBUG|base_utils:0114| [stdout] Starting netserver with host
> 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> 06/24 22:51:01 INFO |base_barri:0163| tag=start_1 port=11922 timeout=600
> hostid=172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0497| masterid: 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0503| members: 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0510| selected as master: self._hostid =
> 172.25.43.226#netperf-server , self._masterid = 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0163| tag=start_1 port=11922 timeout=600
> hostid=172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0497| masterid: 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0503| members: 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0513| selected as slave
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0406| calling master
> 06/24 22:51:01 INFO |base_barri:0211| new client tag=start_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0246| client now waiting:
> 172.25.43.234#netperf-client (172.25.43.226:56000)
> 06/24 22:51:01 INFO |base_barri:0263| master said: wait
> 06/24 22:51:01 INFO |base_barri:0281| slave now waiting: (172.25.43.226:11922)
> 06/24 22:51:01 INFO |base_barri:0372| master seen 1 of 1
> 06/24 22:51:01 INFO |base_barri:0430| slave connected to master
> 06/24 22:51:01 INFO |base_barri:0296| checking client present:
> 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 599
> 06/24 22:51:01 INFO |base_barri:0449| master said: ping
> 06/24 22:51:01 INFO |base_barri:0461| pong
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 19
> 06/24 22:51:01 INFO |base_barri:0320| Releasing clients
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 19
> 06/24 22:51:01 INFO |base_barri:0449| master said: rlse
> 06/24 22:51:01 INFO |base_barri:0344| closing client:
> 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0470| was released, waiting for close
> 06/24 22:51:01 INFO |base_barri:0163| tag=stop_1 port=11922 timeout=315
> hostid=172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 19
> 06/24 22:51:01 INFO |base_barri:0497| masterid: 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0344| closing client:
> 172.25.43.234#netperf-client
> 06/24 22:51:01 INFO |base_barri:0503| members: 172.25.43.234#netperf-client
> 06/24 22:51:01 DEBUG|base_utils:0077| Running 'sleep 5 && mpstat -P ALL 3 5'
> 06/24 22:51:01 INFO |base_barri:0510| selected as master: self._hostid =
> 172.25.43.226#netperf-server , self._masterid = 172.25.43.226#netperf-server
> 06/24 22:51:01 INFO |base_barri:0189| seconds remaining: 314
> 06/24 22:51:01 DEBUG|base_utils:0077| Running
> '/ghostcache/autotest/tmp/netperf2/src/src/netperf -H 172.25.43.226 -t
> TCP_STREAM -l 15 -s 5 '
> 06/24 22:51:11 INFO |base_barri:0211| new client tag=start_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:11 WARNI|base_barri:0220| client arriving for the wrong barrier:
> stop_1 != start_1
> 06/24 22:51:11 INFO |base_barri:0372| master seen 0 of 1
> 06/24 22:51:11 INFO |base_barri:0189| seconds remaining: 305
> 06/24 22:51:11 INFO |base_barri:0211| new client tag=start_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:11 WARNI|base_barri:0220| client arriving for the wrong barrier:
> stop_1 != start_1
> 06/24 22:51:11 INFO |base_barri:0372| master seen 0 of 1
> 06/24 22:51:11 INFO |base_barri:0189| seconds remaining: 305
> ...
> ...
> ...
> ...
> 06/24 22:51:21 INFO |base_barri:0211| new client tag=start_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:21 WARNI|base_barri:0220| client arriving for the wrong barrier:
> stop_1 != start_1
> 06/24 22:51:21 INFO |base_barri:0372| master seen 0 of 1
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 294
> 06/24 22:51:21 INFO | netperf2:0179| [* Command:
> sleep 5 && mpstat -P ALL 3 5
> Exit status: 0
> Duration: 20.0016210079
> ...
> ...
> ...
> ...
> /ghostcache/autotest/tmp/netperf2/src/src/netperf -H 172.25.43.226 -t
> TCP_STREAM -l 15 -s 5
> Exit status: 1
> Duration: 19.0096800327
>
> stdout:
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 172.25.43.226 () port 0 AF_INET
> send_data: data send error: errno 104
> stderr:
> netperf: send_omni: send_data failed: Connection reset by peer]
> 06/24 22:51:21 INFO |base_barri:0211| new client tag=start_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:21 WARNI|base_barri:0220| client arriving for the wrong barrier:
> stop_1 != start_1
> 06/24 22:51:21 INFO |base_barri:0372| master seen 0 of 1
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 294
> 06/24 22:51:21 INFO |base_barri:0163| tag=stop_1 port=11922 timeout=300
> hostid=172.25.43.234#netperf-client
> 06/24 22:51:21 INFO |base_barri:0497| masterid: 172.25.43.226#netperf-server
> 06/24 22:51:21 INFO |base_barri:0503| members: 172.25.43.234#netperf-client
> 06/24 22:51:21 INFO |base_barri:0513| selected as slave
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 299
> 06/24 22:51:21 INFO |base_barri:0211| new client tag=start_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 299
> 06/24 22:51:21 WARNI|base_barri:0220| client arriving for the wrong barrier:
> stop_1 != start_1
> 06/24 22:51:21 INFO |base_barri:0406| calling master
> 06/24 22:51:21 INFO |base_barri:0372| master seen 0 of 1
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 294
> 06/24 22:51:21 INFO |base_barri:0211| new client tag=stop_1,
> name=172.25.43.234#netperf-client, hostid=172.25.43.226#netperf-server
> 06/24 22:51:21 INFO |base_barri:0246| client now waiting:
> 172.25.43.234#netperf-client (172.25.43.226:56048)
> 06/24 22:51:21 INFO |base_barri:0263| master said: wait
> 06/24 22:51:21 INFO |base_barri:0281| slave now waiting: (172.25.43.226:11922)
> 06/24 22:51:21 INFO |base_barri:0372| master seen 1 of 1
> 06/24 22:51:21 INFO |base_barri:0430| slave connected to master
> 06/24 22:51:21 INFO |base_barri:0296| checking client present:
> 172.25.43.234#netperf-client
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 299
> 06/24 22:51:21 INFO |base_barri:0449| master said: ping
> 06/24 22:51:21 INFO |base_barri:0461| pong
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 19
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 19
> 06/24 22:51:21 INFO |base_barri:0320| Releasing clients
> 06/24 22:51:21 INFO |base_barri:0449| master said: rlse
> 06/24 22:51:21 INFO |base_barri:0344| closing client:
> 172.25.43.234#netperf-client
> 06/24 22:51:21 INFO |base_barri:0470| was released, waiting for close
> 06/24 22:51:21 DEBUG|base_utils:0077| Running 'killall netserver'
> 06/24 22:51:21 INFO |base_barri:0189| seconds remaining: 19
> 06/24 22:51:21 INFO |base_barri:0344| closing client:
> 172.25.43.234#netperf-client
> 06/24 22:51:21 INFO | net_utils:0069| Linux 3.10.24-3.10.0.2-amd64-12321334
> (a172-25-43-226.deploy.akamaitechnologies.com) 06/24/14 _x86_64_
> (8 CPU)
> ...
> ...
> ...
> ...
>
>
>
>
> Eventually netperf itself seems to have failed
>
>
> /usr/local/autotest/results/19-debug_user$ cat
> ./172.25.43.226/netperf2.client
> /results/keyval
> stream_count{attr}=1
> CPU_C_HI{perf}=0.0
> CPU_C_SI{perf}=2.30666666667
> CPU_C_SYS{perf}=20.0366666667
> CPU_C{perf}=100.0
> Efficieny_C{perf}=0.0
> Failed_streams_count{perf}=1
> INTRS_C{perf}=71.3366666667
> Throughput{perf}=0
> actual_time{perf}=15.0036661625
> Any help in understanding these barrier failures will be very helpful.
> Thanks,
> Bhupesh
>
--
Amos.
_______________________________________________
Autotest-kernel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/autotest-kernel