On Wed, Aug 11, 2010 at 05:13:22PM +0530, Vipul Agrawal wrote:
> On Wed, Aug 11, 2010 at 4:45 PM, Olaf Till <[email protected]> wrote:
> 
> > On Wed, Aug 11, 2010 at 11:51:43AM +0530, Vipul Agrawal wrote:
> > > >On Sat, Aug 07, 2010 at 05:31:24PM +1100, Lutaev D. A. wrote:
> > > >> We used parallel-2.0.2 and we have problems with such code:
> > > >>
> > > >> clear;
> > > >>
> > > >> hosts = [];
> > > >>
> > > >> for i = 1:nargin
> > > >>         hosts = [hosts; argv(){i, 1}];
> > > >> end
> > > >>
> > > >> hosts
> > > >>
> > > >> sockets = connect(hosts)
> > > >>
> > > >> x = rand(50, 1000);
> > > >>
> > > >> send(x, sockets(2, :));
> > > >> reval("x = recv(sockets(1, :))", sockets(2, :));
> > > >> scloseall(sockets);
> > > >>
> > > >> Programm stucks when it's trying to send x from sockets(1, :) (master)
> > to
> > > >> slave (sockets(2, :)).
> > > >
> > > >As I said, I'm unable to reproduce the problem. Maybe it won't help,
> > > >but why don't you send a real session transcript (cut-and-paste from
> > > >your terminal running Octave) and indicate exactly the command which
> > > >"stucks"? Commands which you only intended to give are of no use to
> > > >me. Since I have no notion as yet what the cause of the problem is,
> > > >the contents of the variables "host" and "sockets" may be important;
> > > >why don't you show it? Of corse you should hide the real hostnames,
> > > >but I have to see whether they are different, whether the local
> > > >machine is among the servers, and what is the length of the hostnames.
> > > >
> > > >Do you use Octave-3.2.4 and parallel-2.0.2 on _all_ machines?
> > > >
> > > >What you still can do is to check whether the server process and child
> > > >process are running before and after the "stucking" command (on each
> > > >server machine: ps ax | grep octave    and post the output (replacing
> > > >hostnames, of corse)).
> > > >
> > > >Olaf
> > > >
> > >
> > > I am using octave-3.2.4 from maverick repo and parallel-2.0.2 build from
> > > source. I am also getting the same issue with big matrices.
> > > I could not send more than 32767 elements(2^15-1) of type double(size 8
> > > bytes) = 262136
> > > The reason maybe be incorrect buffer size.
> > > the bufsize in pserver.cc in line 507:
> > >     int bufsize = 262144;
> > > A possible solution is to change to
> > >     int bufsize = BUFF_SIZE;
> > >
> > > Now, the no. of elements increases to about 46k which interestingly comes
> > > out to be a magic number equal to 2^15 * sqrt(2). Quite Amazing!
> > > I think there is still some other issue which stalls sending matrices
> > larger
> > > than this size.
> > >
> > > -Vipul
> >
> > "send" does not return until the whole value is written to the
> > socket. If the values length exceeds the sockets buffer size, a
> > process at the far end of the connection must read data for "send"
> > being able to return. So before "send" in the master process, one must
> > first start "recv" on the other end, e.g.:
> >
> > octave:13> reval ("send (recv (sockets(1, :)), sockets(1, :))", sockets(2,
> > :))
> > octave:14> send (ones (100, 1000000), sockets(2, :))
> > octave:15> size (recv (sockets(2, :)))
> > ans =
> >
> >       100   1000000
> >
> > octave:16>
> >
> > I don't know why the sockets buffersize for outgoing connections has
> > been set to a lower value than for incoming connections in pserver.cc;
> > this probably should be corrected, since BUFF_SIZE (the higher value)
> > is considered by send.cc. But this should not be essential (only a
> > matter of efficiency).
> >
> > Thanks for the report.
> >
> > Olaf
> >
> > Hi Olaf,
> Your trick works fine. Now, I am able to send and receive large matrices.
> But it seems that the data becomes corrupted between transfer.
> For example:
> hosts = ['host1';'host2';'host3'];
> sockets = connqect(hosts)
> 
> sockets =
> 
>       0      0      0
>      13     11   1234
>      14     12   1234
> 
> a = rand(50);
> reval("temp = recv(sockets(1,:));",sockets(2,:));
> send(a,sockets(2,:));
> reval("send(temp,sockets(1,:))",sockets(2,:));
> b = recv(sockets(2,:));
> isequal(a,b)
> 
> ans = 0
> 
> octave:> a - b
> columns 1-7 are all zeros. columns 9-50 are all non-zero.
> Columns 6 through 10:
> 
>     0.0000e+00    0.0000e+00    0.0000e+00    6.1764e+21    2.0570e-01
>     0.0000e+00    0.0000e+00    0.0000e+00    3.4265e-02   -2.3695e+35
>     0.0000e+00    0.0000e+00    0.0000e+00    2.0374e+20  -1.0870e+267
>     0.0000e+00    0.0000e+00    0.0000e+00    9.0336e-01  -2.9740e+207
>     0.0000e+00    0.0000e+00    0.0000e+00    5.2625e-01    2.1413e-01
>     0.0000e+00    0.0000e+00    0.0000e+00    8.7713e-01  -1.8943e+280
>     0.0000e+00    0.0000e+00    0.0000e+00   -2.1483e+70    9.9818e+50
>     0.0000e+00    0.0000e+00    0.0000e+00    7.4401e-01    6.9315e+09
>     0.0000e+00    0.0000e+00    0.0000e+00    5.0168e-01    7.7556e-01
>     0.0000e+00    0.0000e+00    0.0000e+00  -1.1288e+161    1.0402e-01
>     0.0000e+00    0.0000e+00   -4.1461e+32    8.0617e-01   1.1567e+134
>     0.0000e+00    0.0000e+00   1.7965e+251  -2.4074e+254   3.1141e+147
>     0.0000e+00    0.0000e+00    1.5704e+79    4.5556e+69    8.1072e-01
>     0.0000e+00    0.0000e+00   -6.7451e+37    1.1328e-01    1.0592e+83
>     0.0000e+00    0.0000e+00    7.5392e-01   -4.6222e+83    3.1915e-01
>     .....
> 
> Does anybody else has reproduced similar problem?

Not me --- for me, "a" and "b" are identical.

Several thoughts:

- While Octave, when saving and loading data, is AFAIK supposed to
  care for endian-ness and possible differences in floating point
  format, "send" and "recv" do not use the saving and loading
  functionality of Octave, and only consider endian-ness.

- It is difficult to rewrite "send" and "recv" to use the above
  functionality of Octave, since the latter is based on Octaves stream
  ids, which know nothing of the "externally" allocated sockets.

- However, if the first issue is the reason for the corruption, I
  wander why the first columns of "a - b" should be zero.

- Has your slave machine a different architecture than the master? If
  yes, you could test the above with

hosts = ["localhost"; "localhost"];

  or, if possible, with a slave machine of architecture identical to
  master.

Olaf

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Octave-dev mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/octave-dev

Reply via email to