Re: [gt-user] Re: STDOUT is blocked it seems

Yoichi Takayama Wed, 22 Oct 2008 17:52:25 -0700

Hi Charles,

Thanks for the reply.



(1) The error and the GLOBUS_TCP_PORT_RANGE

I have tried globus-job-run grid2.ramscommunity.org/jobmanager-condor / bin/hostname with iptables all stopped temporarily, but it comes back with the same errors. With all ports are open, GLOBUS_TCP_PORT_RANGE should be irrelevant.


Also, I can ovserver which connections are established with netstat

I think that something is wrong with the internal working of gatekeeper or jobmanager, which seems to fail to access the STDOUT twice.

First it fails to write to it to return the jobID whent the job is submitted.


(Then, it goes on to poll the Condor.)

Second it fails to write to it to return the results when the Condor returns the results to the jobmanager.

Can you think of something could be wrong like this? Some configuration I did is somehow wrong to cause such?



(2) How to define LOBUS_TCP_PORT_RANGE for "client"

What do you mean by defining the GLOBUS_TCP_PORT_RANGE for the client?

What client? For Condor-G? Where should I define it? It is defined for GridFTP. Is it enough?

Or, do you mean the end user's environment? The end user has GLOBUS_TCP_PORT_RANGE=40000,41000 defined in his/her env.


$ env | grep GLOBUS

GLOBUS_PATH=/usr/local/globus
GLOBUS_LOCATION=/usr/local/globus
GLOBUS_TCP_PORT_RANGE=40000,41000

Condor user is defined as normal user than a system user, so it gets this, too, but it does not seem to affect it.



(3) Actual port connections opened


grid1 (Submit node)

After globus-job-run was executed, GLOBUS related port connections seem to be:

Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 grid1.ramscommunity.o:40001 grid2.ramscommunity.o: 47175 FIN_WAIT2 (outside range?) tcp 0 0 grid1.ramscommunity.o:52990 grid2.ramscom:gsigatekeeper FIN_WAIT2 (allowed, outside range?)) tcp 0 0 grid1.ramscommunity.o:40001 grid2.ramscommunity.o: 47176 FIN_WAIT2 (outside range?) tcp 0 0 grid1.ramscomm:pcsync-https grid2.ramscommunity.o: 60911 ESTABLISHED (outside range?) tcp 0 0 grid1.ramscommunity.o:40585 grid2.ramscomm:pcsync- https ESTABLISHED (has been open before)

After the GRAM job has finished all connections are closed except for the Condor port.



grid2 (GRAM2 gatekeeper and jobmanager):

Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 grid2.ramscommunity.o:47176 grid1.ramscommunity.o: 40001 CLOSE_WAIT (allowed, outside range?) tcp 0 0 grid2.ramscommunity.o:47175 grid1.ramscommunity.o: 40001 CLOSE_WAIT (allowed, outside range?) tcp 1 0 grid2.ramscom:gsigatekeeper grid1.ramscommunity.o: 52990 CLOSE_WAIT (NOT allowed, outside range?) tcp 0 0 grid2.ramscommunity.o:60911 grid1.ramscomm:pcsync- https ESTABLISHED (allowed, outside range?) tcp 0 0 grid2.ramscomm:pcsync-https grid2.ramscommunity.o: 35508 ESTABLISHED (allowed, internal) tcp 0 0 grid2.ramscomm:pcsync-https grid1.ramscommunity.o: 40585 ESTABLISHED (allowed) tcp 0 0 grid2.ramscommunity.o:35508 grid2.ramscomm:pcsync- https ESTABLISHED (allowed, internal)


After the GRAM job has finished all connections are closed except:

This was transient:

tcp 0 0 grid2.ramscommunity.o:35508 grid2.ramscomm:pcsync- https TIME_WAIT


Then, these port connections remain:

Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 grid2.ramscommunity.o:47176 grid1.ramscommunity.o: 40001 CLOSE_WAIT tcp 0 0 grid2.ramscommunity.o:47175 grid1.ramscommunity.o: 40001 CLOSE_WAIT tcp 1 0 grid2.ramscom:gsigatekeeper grid1.ramscommunity.o: 52990 CLOSE_WAIT (Is this the problem?)


Then, much later: (Not sure why this is invoked)

tcp 0 0 grid2.ramscommunity.o:55963 grid2.ramscomm:pcsync- https ESTABLISHED (allowed, internal) tcp 0 0 grid2.ramscommunity.o:51734 grid1.ramscomm:pcsync- https ESTABLISHED (allowed, outside range?) tcp 0 0 grid2.ramscomm:pcsync-https grid1.ramscommunity.o: 32902 ESTABLISHED (NOT allowed) tcp 0 0 grid2.ramscomm:pcsync-https grid2.ramscommunity.o: 55963 ESTABLISHED (allowed, internal)


Then, all these are closed after a while????



grid4 (Condor Execute node):

After globus-job-run

tcp 0 0 grid4.ramscommunity.o:45203 grid2.ramscomm:pcsync- https ESTABLISHED (allowed, outside range?) tcp 0 0 grid4.ramscomm:pcsync-https grid2.ramscommunity.o: 36708 ESTABLISHED (NOT allowed)


Then:

tcp 0 0 grid4.ramscommunity.o:46969 grid2.ramscomm:pcsync- https TIME_WAIT (allowed, outside range?) tcp 0 0 grid4.ramscommunity.o:52203 grid4.ramscomm:pcsync- https TIME_WAIT (allowed, outside range?)


Then, after the job was done, all were closes once.

Much later, for some reason these are opened for no obvious reason (still trying to write out the results?) tcp 0 0 grid4.ramscomm:pcsync-https grid2.ramscommunity.o: 35554 ESTABLISHED (NOT allowed) tcp 0 0 grid4.ramscomm:pcsync-https grid4.ramscommunity.o: 44558 ESTABLISHED (allowed, internal) tcp 0 0 grid4.ramscommunity.o:44558 grid4.ramscomm:pcsync- https ESTABLISHED (allowed, internal) tcp 0 0 grid4.ramscommunity.o:58535 grid2.ramscomm:pcsync- https ESTABLISHED (allowed, outside range?)

Overall the GLOBUS_TCP_PORT_RANGE is enforced to the jobID and most connections, but not all connections use the range specified. Is this normal???

All the NOT allowed connections could be from the previous jobs not "cleaned"??? It does not allow me to use globus-job-clean (because writing to grid1:STDOUT from grid2:jobmanager fails) and they do not appear on condor_q. Is there a way to view Globus queue and manually clean?


Your help in these is much appreciated.

Thanks,
Yoichi

P.S. Attached below are the configurations of GRAM2.


----------------------------------------------------------------------------------------------------------------
GRAM2 related configurations
----------------------------------------------------------------------------------------------------------------

# cat /usr/local/globus/etc/globus-gatekeeper.conf

  -x509_cert_dir /etc/grid-security/certificates
  -x509_user_cert /etc/grid-security/hostcert.pem
  -x509_user_key /etc/grid-security/hostkey.pem
  -gridmap /etc/grid-security/grid-mapfile
  -home /usr/local/globus
  -e libexec
  -logfile var/globus-gatekeeper.log
  -port 2119
  -grid_services etc/grid-services
  -inetd


# cat /etc/xinetd.d/globus-gatekeeper

service gsigatekeeper
{
   socket_type  = stream
   protocol     = tcp
   wait         = no
   user         = root
   env          = LD_LIBRARY_PATH=/usr/local/globus/lib
   server       = /usr/local/globus/sbin/globus-gatekeeper
   server_args  = -conf /usr/local/globus/etc/globus-gatekeeper.conf
   disable      = no
   env         += GLOBUS_TCP_PORT_RANGE=40000,41000
}

# cat $GLOBUS_LOCATION/etc/globus-job-manager.conf

        -home "/usr/local/globus"
        -globus-gatekeeper-host grid2.ramscommunity.org
        -globus-gatekeeper-port 2119

-globus-gatekeeper-subject "/O=Grid/OU=GlobusTest/OU=simpleCA- grid2.ramscommunity.org/CN=host/grid2.ramscommunity.org"

        -globus-host-cputype i686
        -globus-host-manufacturer pc
        -globus-host-osname Linux
        -globus-host-osversion 2.6.18-92.1.10.el5
        -globus-toolkit-version 4.2.0
        -save-logfile on_error
        -state-file-dir /usr/local/globus/tmp/gram_job_state
        -machine-type unknown


# cat $GLOBUS_LOCATION/etc/grid-services/jobmanager-condor

stderr_log,local_cred - /usr/local/globus/libexec/globus-job-manager globus-job-manager -conf /usr/local/globus/etc/globus-job-manager.conf -type condor -rdn jobmanager-condor -machine-type unknown -publish- jobs -condor-arch INTEL -condor-os LINUX


$ ps -ef
...

globus 5250 5208 0 17:46 pts/1 00:00:00 perl /usr/local/globus/ sbin/globus-job-manager-event-generator -s condor globus 5251 5250 0 17:46 pts/1 00:00:00 /usr/local/globus/ libexec/globus-scheduler-event-generator -s condor -t 1223444972



--------------------------------------------------------------------------
Yoichi Takayama, PhD
Senior Research Fellow
RAMP Project
MELCOE (Macquarie E-Learning Centre of Excellence)
MACQUARIE UNIVERSITY

Phone: +61 (0)2 9850 9073
Fax: +61 (0)2 9850 6527
www.mq.edu.au
www.melcoe.mq.edu.au/projects/RAMP/
--------------------------------------------------------------------------
MACQUARIE UNIVERSITY: CRICOS Provider No 00002J

This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie E-Learning Centre Of Excellence (MELCOE) or Macquarie University.


On 23/10/2008, at 1:28 AM, Charles Bacon wrote:

From the last message you sent yesterday, it looked to me like you weren't setting GLOBUS_TCP_PORT_RANGE in your client's environment. That's a requirement too.
No, the host/ in the DN is irrelevant.
You got back grid2 because you submitted to fork, which always runs on the same machine as the gatekeeper.
Charles

On Oct 22, 2008, at 1:54 AM, Yoichi Takayama wrote:
Hi Charles,

Thanks for the help you have been giving.
I don't think that it has anything to do with GLOBUS_TIC_PORT_RANGE after it has been fixed on /etc/xinetd.d/globus-gatekeeper
It seem it is something else which stops the jobmanager (?) to write back the results.
I found something weird.

------------------------------------------------------------------------------
$ globus-job-run grid2.ramscommunity.org/jobmanager-fork /bin/ hostname
grid2.ramscommunity.org
GRAM Job submission failed because data transfer to the server failed (error code 10)
------------------------------------------------------------------------------
As you can see, it returned the results but also reported the error. Also, this is not the same as condor_submit results. It is grid1 or grid4 (i.e. they are Execute nodes and returns those host names, since grid2 is not an Execute node).
But this happened only the first time.
From the 2nd time onwards and also any other commands (eg. /bin/ date) all returns the same error code 10 and no results is printed out.
It means the normal STDOUT is blocked after it is used once????
If I use jobmanager-condor, although it reports the error, it goes ahead and submits the job, keep polling the progress, and the job returns with success, only to fail to write out the results again.
Why do you think this happens? How to eradicate it? Is that some lock problem? The gram log seems to show that NFS sync has been attempted. Is it an NFS problem? Do I need to remove sync,no_wdelay for example? It does not seem to be Permission problem.
/home *.ramscommunity .org (rw ,insecure,sync,no_wdelay,no_subtree_check,nohide,mp,no_root_squash)
If it is some kind of bug, should I upgrade it to GT4.2.1 in case? Can I use it from the binary on CentOS 5.2? Do I have to use the source?
Also, I noticed that the MyProxy perl script generate the CA certs with CN=host/grid2..... on the MyProxy server, but following the QuickStart, the other host certs are issued with CN=grid1.... or CN=grid4... (i.e. without the host/). Does this matter?
Thanks,
Yoichi

smime.p7s
Description: S/MIME cryptographic signature

Re: [gt-user] Re: STDOUT is blocked it seems

Reply via email to