Re: [OMPI devel] thread model

2007-09-05 Thread Ralph Castain
See below


On 9/5/07 7:04 PM, "Jeff Squyres"  wrote:

> Greg: sorry for the delay in replying...
> 
> I am not the authority on this stuff; can George / Brian / Terry /
> Brad / Gleb reply on this issue?
> 
> Thanks.
> 
> On Aug 28, 2007, at 12:57 PM, Greg Watson wrote:
> 
>>> Note that this is *NOT* well tested.  There is work going on right
>>> now to make the OMPI layer be able to support MPI_THREAD_MULTIPLE
>>> (support was designed in from the beginning, but we haven't ever done
>>> any kind of comprehensive testing/stressing of multi-thread support
>>> such that it is pretty much guaranteed not to work), but it is
>>> occurring on the trunk (i.e., what will eventually become v1.3) --
>>> not the v1.2 branch.
>>> 
 The interfaces I'm calling are:
 
 opal_event_loop()
>>> 
>>> Brian or George will have to answer about that one...
>>> 
 opal_path_findv()
>>> 
>>> This guy should be multi-thread safe (disclaimer: haven't tested it
>>> myself); it doesn't rely on any global state.
>>> 
 orte_init()
 orte_ns.create_process_name()
 orte_iof.iof_subscribe()
 orte_iof.iof_unsubscribe()
 orte_schema.get_job_segment_name()
 orte_gpr.get()
 orte_dss.get()
 orte_rml.send_buffer()
 orte_rmgr.spawn_job()
 orte_pls.terminate_job()
 orte_rds.query()
 orte_smr.job_stage_gate_subscribe()
 orte_rmgr.get_vpid_range()
>>> 
>>> Note that all of ORTE is *NOT* thread safe, nor is it planned to be
>>> (it just seemed way more trouble than it was worth).  You need to
>>> serialize access to it.
>> 
>> Does that mean just calling OPAL_THREAD_LOCK() and OPAL_THREAD_UNLOCK
>> () around each?

We actually do thread locks inside of these - just a big LOCK when you
enter, with a corresponding UNLOCK when you leave - so I'm not sure how much
good you'll get from adding locks around the calls themselves. The majority
of threading issues in this area have to do with the progress engine and our
interactions with that beast - I'm not sure we entirely understand those
issues yet.

Ralph





Re: [MTT devel] [MTT users] Test runs not getting into database

2007-09-05 Thread Jeff Squyres

Josh / Ethan --

Not getting a serial means that the client is not getting a value  
back from the server that it can parse into a serial.


Can you guys dig into this and see why the mtt dbdebug file that Tim  
has at the end of this message is not getting a serial?


Thanks...


On Sep 5, 2007, at 9:24 AM, Tim Prins wrote:


Here is the smallest one. Let me know if you need anything else.

Tim

Jeff Squyres wrote:
Can you send any one of those mtt database files?  We'll need to   
figure out if this is a client or a server problem.  :-(

On Sep 5, 2007, at 7:40 AM, Tim Prins wrote:

Hi,

BigRed has not gotten its test results into the database for a  
while.

This is running the ompi-core-testers branch. We run by passing the
results through the mtt-relay.

The mtt-output file has lines like:
*** WARNING: MTTDatabase did not get a serial; phases will be  
isolated

from each other in the reports

Reported to MTTDatabase: 1 successful submit, 0 failed submits

(total of 1 result)

I have the database submit files if they would help.

Thanks,

Tim

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users


$VAR1 = {
  'exit_signal_1' => -1,
  'duration_1' => '5 seconds',
  'mpi_version' => '1.3a1r16038',
  'trial' => 0,
  'mpi_install_section_name_1' => 'bigred 32 bit gcc',
  'client_serial' => undef,
  'hostname' => 's1c2b12',
  'result_stdout_1' => '/bin/rm -f *.o *~ PI* core IMB-IO  
IMB-EXT IMB-MPI1 exe_io exe_ext exe_mpi1

touch IMB_declare.h
touch exe_mpi1 *.c; rm -rf exe_io exe_ext
make MPI1 CPP=MPI1
make[1]: Entering directory `/N/ptl01/mpiteam/bigred/20070905- 
Wednesday/pb_0/installs/d7Ri/tests/imb/IMB_2.3/src\'

mpicc  -I.  -DMPI1 -O -c IMB.c
mpicc  -I.  -DMPI1 -O -c IMB_declare.c
mpicc  -I.  -DMPI1 -O -c IMB_init.c
mpicc  -I.  -DMPI1 -O -c IMB_mem_manager.c
mpicc  -I.  -DMPI1 -O -c IMB_parse_name_mpi1.c
mpicc  -I.  -DMPI1 -O -c IMB_benchlist.c
mpicc  -I.  -DMPI1 -O -c IMB_strgs.c
mpicc  -I.  -DMPI1 -O -c IMB_err_handler.c
mpicc  -I.  -DMPI1 -O -c IMB_g_info.c
mpicc  -I.  -DMPI1 -O -c IMB_warm_up.c
mpicc  -I.  -DMPI1 -O -c IMB_output.c
mpicc  -I.  -DMPI1 -O -c IMB_pingpong.c
mpicc  -I.  -DMPI1 -O -c IMB_pingping.c
mpicc  -I.  -DMPI1 -O -c IMB_allreduce.c
mpicc  -I.  -DMPI1 -O -c IMB_reduce_scatter.c
mpicc  -I.  -DMPI1 -O -c IMB_reduce.c
mpicc  -I.  -DMPI1 -O -c IMB_exchange.c
mpicc  -I.  -DMPI1 -O -c IMB_bcast.c
mpicc  -I.  -DMPI1 -O -c IMB_barrier.c
mpicc  -I.  -DMPI1 -O -c IMB_allgather.c
mpicc  -I.  -DMPI1 -O -c IMB_allgatherv.c
mpicc  -I.  -DMPI1 -O -c IMB_alltoall.c
mpicc  -I.  -DMPI1 -O -c IMB_sendrecv.c
mpicc  -I.  -DMPI1 -O -c IMB_init_transfer.c
mpicc  -I.  -DMPI1 -O -c IMB_chk_diff.c
mpicc  -I.  -DMPI1 -O -c IMB_cpu_exploit.c
mpicc   -o IMB-MPI1 IMB.o IMB_declare.o  IMB_init.o  
IMB_mem_manager.o IMB_parse_name_mpi1.o  IMB_benchlist.o  
IMB_strgs.o IMB_err_handler.o IMB_g_info.o  IMB_warm_up.o  
IMB_output.o IMB_pingpong.o IMB_pingping.o IMB_allreduce.o  
IMB_reduce_scatter.o IMB_reduce.o IMB_exchange.o IMB_bcast.o  
IMB_barrier.o IMB_allgather.o IMB_allgatherv.o IMB_alltoall.o  
IMB_sendrecv.o IMB_init_transfer.o  IMB_chk_diff.o IMB_cpu_exploit.o
make[1]: Leaving directory `/N/ptl01/mpiteam/bigred/20070905- 
Wednesday/pb_0/installs/d7Ri/tests/imb/IMB_2.3/src\'

',
  'mpi_name' => 'ompi-nightly-trunk',
  'number_of_results' => '1',
  'phase' => 'Test Build',
  'compiler_version_1' => '3.3.3',
  'exit_value_1' => 0,
  'result_message_1' => 'Success',
  'start_timestamp_1' => 'Wed Sep  5 04:16:52 2007',
  'compiler_name_1' => 'gnu',
  'suite_name_1' => 'imb',
  'test_result_1' => 1,
  'mtt_client_version' => '2.1devel',
  'fields' =>  
'compiler_name,compiler_version,duration,exit_signal,exit_value,mpi_ge 
t_section_name,mpi_install_id,mpi_install_section_name,mpi_name,mpi_ve 
rsion,phase,result_message,result_stdout,start_timestamp,suite_name,te 
st_result',

  'mpi_install_id' => undef,
  'platform_name' => 'IU_BigRed',
  'local_username' => 'mpiteam',
  'mpi_get_section_name_1' => 'ompi-nightly-trunk'
};
___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
Jeff Squyres
Cisco Systems



[OMPI devel] opal_atomic_lifo is not really atomic.

2007-09-05 Thread Gleb Natapov
Hi,

  opal_atomic_lifo implementation suffers from ABA problem.
Here is the code for opal_atomic_lifo_pop:

1 do {
2 item = lifo->opal_lifo_head;
3 if( opal_atomic_cmpset_ptr( &(lifo->opal_lifo_head),
4 item,
5 (void*)item->opal_list_next ) )
6 break;
7 /* Do some kind of pause to release the bus */
8 } while( 1 );
9 if( item == &(lifo->opal_lifo_ghost) ) return NULL;
10item->opal_list_next = NULL;
11return item;

If the following happens:

   Thread1:   Thread2:
1 executes 2
2   executes 1-11 and acquire 
item
3 enters 3 but preempted before cmpxchg
  NOTE: the third parameter passed to cmpset is
  NULL because item is in use by thread2
4   executes lifo_push(item)
5 successfully executes cmpxchg since the old
  value is equal to current value (ABA problem)
  but places NULL into lifo->opal_lifo_head!

Included patch seems to be fixing this problem, but I am not really sure
if this is the right whay to solve this kind of problems.


diff --git a/opal/class/opal_atomic_lifo.h b/opal/class/opal_atomic_lifo.h
index caf35b1..4e8148c 100644
--- a/opal/class/opal_atomic_lifo.h
+++ b/opal/class/opal_atomic_lifo.h
@@ -71,8 +71,10 @@ static inline opal_list_item_t* opal_atomic_lifo_push( 
opal_atomic_lifo_t* lifo,
 item->opal_list_next = lifo->opal_lifo_head;
 if( opal_atomic_cmpset_ptr( &(lifo->opal_lifo_head),
 (void*)item->opal_list_next,
-item ) )
+item ) ) {
+opal_atomic_cmpset_32((volatile int32_t*)>item_free, 1, 0);
 return (opal_list_item_t*)item->opal_list_next;
+}
 /* DO some kind of pause to release the bus */
 } while( 1 );
 #else
@@ -89,14 +91,17 @@ static inline opal_list_item_t* opal_atomic_lifo_pop( 
opal_atomic_lifo_t* lifo )
 {
 opal_list_item_t* item;
 #if OMPI_HAVE_THREAD_SUPPORT
-do {
-item = lifo->opal_lifo_head;
+while((item = lifo->opal_lifo_head) != &(lifo->opal_lifo_ghost))
+{
+if(!opal_atomic_cmpset_32((volatile int32_t*)>item_free, 0, 1))
+continue;
 if( opal_atomic_cmpset_ptr( &(lifo->opal_lifo_head),
 item,
 (void*)item->opal_list_next ) )
 break;
+opal_atomic_cmpset_32((volatile int32_t*)>item_free, 1, 0);
 /* Do some kind of pause to release the bus */
-} while( 1 );
+} 
 #else
 item = lifo->opal_lifo_head;
 lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
diff --git a/opal/class/opal_list.c b/opal/class/opal_list.c
index c8a5568..715715e 100644
--- a/opal/class/opal_list.c
+++ b/opal/class/opal_list.c
@@ -55,6 +55,7 @@ OBJ_CLASS_INSTANCE(
 static void opal_list_item_construct(opal_list_item_t *item)
 {
 item->opal_list_next = item->opal_list_prev = NULL;
+item->item_free = 1;
 #if OMPI_ENABLE_DEBUG
 item->opal_list_item_refcount = 0;
 item->opal_list_item_belong_to = NULL;
diff --git a/opal/class/opal_list.h b/opal/class/opal_list.h
index 83fa57b..3a45f4e 100644
--- a/opal/class/opal_list.h
+++ b/opal/class/opal_list.h
@@ -102,6 +102,7 @@ struct opal_list_item_t
 /**< Pointer to next list item */
 volatile struct opal_list_item_t *opal_list_prev;
 /**< Pointer to previous list item */
+int32_t item_free;

 #if OMPI_ENABLE_DEBUG
 /** Atomic reference count for debugging */
--
Gleb.