I'm running out of places to add log messages to in the code :)

I see a possible cause that I missed before, but we should be able to check this one without a patch. Can you do a "diff" of the two configuration files and see if they are different in any way? In particular do the ID values match?

thanks,
-Phil

Asterios Katsifodimos wrote:
No, the systems are identical :)

[r...@wn140 ~]# hostname
wn140.grid.ucy.ac.cy <http://wn140.grid.ucy.ac.cy>
[r...@wn140 ~]# uname -a
Linux wn140.grid.ucy.ac.cy <http://wn140.grid.ucy.ac.cy> 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan 14 19:07:47 CST 2009 i686 athlon i386 GNU/Linux
[r...@wn140 ~]# cat /etc/redhat-release
Scientific Linux SL release 4.7 (Beryllium)
[r...@wn140 pvfs-2.8.1]# more /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2214
stepping        : 2
cpu MHz         : 2200.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext fxsr_opt rdtscp l



[r...@wn141 ~]# hostname
wn141.grid.ucy.ac.cy <http://wn141.grid.ucy.ac.cy>
[r...@wn141 ~]# uname -a
Linux wn141.grid.ucy.ac.cy <http://wn141.grid.ucy.ac.cy> 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan 14 19:07:47 CST 2009 i686 athlon i386 GNU/Linux
[r...@wn141 ~]# cat /etc/redhat-release
Scientific Linux SL release 4.7 (Beryllium)
[r...@wn141 pvfs-2.8.1]# more /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2214
stepping        : 2
cpu MHz         : 2200.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext fxsr_opt rdtscp lm


Patch applied, logs updated!
http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy
http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

thanks,
Asterios Katsifodimos
High Performance Computing systems Lab
Department of Computer Science, University of Cyprus
http://grid.ucy.ac.cy


On Mon, Apr 6, 2009 at 10:03 PM, Phil Carns <[email protected] <mailto:[email protected]>> wrote:

    That didn't show what I expected at all.  It must have hit a safety
    check on the request parameters.  Could you try adding in the
    attached patch as well?

    What kind of systems are these?  Are the two servers different
    architectures by any chance?


    thanks,
    -Phil

    Asterios Katsifodimos wrote:

        Thanks!
        I have applied the patch.

        I have replaced the old logs with the new ones. Just use the
        previous links.
        http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy
        http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

        thanks a lot for your help,
        On Mon, Apr 6, 2009 at 8:41 PM, Phil Carns <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

           Thanks for posting the logs.  It looks like the create_list
        function
           in within Trove actually generated the EINVAL error, but there
           aren't enough log messages in that path to know why.

           Any chance you could apply the patch attached to this email and
           retry this scenario (with verbose logging)?  I'm hoping for some
           extra output after the line that looks like this:

           (0x8d4f020) batch_create (prelude sm) state: perm_check
        (status = 0)


           thanks,
           -Phil


           Asterios Katsifodimos wrote:

               Yes both of them. Because now both are Metadata servers.
        When I
               had one metadata and
               one IO server, the metadata server was not producing the
        errors
               until the IO server got up.
                From the time that the IO server gets up, the Metadata
        server
               is getting crazy...

               I have uploaded the log files here:
               http://grid.ucy.ac.cy/file/pvfs_logwn140.grid.ucy.ac.cy
               http://grid.ucy.ac.cy/file/pvfs_logwn141.grid.ucy.ac.cy

               have a look!

               thanks!
               On Mon, Apr 6, 2009 at 7:00 PM, Phil Carns
        <[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>>>
        wrote:

                  Ok.  Could you try "verbose" now as the log level?  It is
               close to
                  the "all" level but should only print information
        while the
               server
                  is busy.

                  Are both wn140 and wn141 showing the same batch create
        errors, or
                  just one of them?


                  thanks,
                  -Phil

                  Asterios Katsifodimos wrote:

                      Hello Phil,

                      Thanks for you answer.
                      Yes I delete the storage dir every time I make a new
               configuration
                      and I run the pvfs2-server -f command before
        starting the
               daemons.

                      The only thing that I get from the servers is the
               batch_create,
                      starting server, and the "PVFS2 server got signal 15
                      (server_status_flag: 507903"
                      error message. Do you want me to try on an other
        log level?

                      Also, this is how the server is configured:
                      ***** Displaying PVFS Configuration Information *****
                      ------------------------------------------------------
PVFS2 configured to build karma gui : no PVFS2 configured to perform coverage analysis : no PVFS2 configured for aio threaded callbacks : yes PVFS2 configured to use FUSE : no PVFS2 configured for the 2.6.x kernel module : no PVFS2 configured for the 2.4.x kernel module : no PVFS2 configured for using the mmap-ra-cache : no
                      PVFS2 will use workaround for redhat 2.4 kernels
         :  no
PVFS2 will use workaround for buggy NPTL : no PVFS2 server will be built : yes

                      PVFS2 version string: 2.8.1


                      thanks again,
                      On Mon, Apr 6, 2009 at 5:21 PM, Phil Carns
               <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
               <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>

                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>>>

               wrote:

                         Hello,

                         I'm not sure what would cause that "Invalid
        argument"
               error.

                         Could you try the following steps:

                         - kill both servers
                         - modify your configuration files to set
               "EventLogging" to "none"
                         - delete your old log files (or move them to
        another
               directory)
                         - start the servers

                         You can then send us the complete contents of
        both log
               files
                      and we
                         can go from there.  The "all" level is a little
        hard
               to interpret
                         because it generates a lot of information even when
               servers
                      are idle.

                         Also, when you went from one server to two, did
        you delete
                      your old
                         storage space (/pvfs) and start over, or are
        you trying to
                      keep that
                         data and add servers to it?

                         thanks!
                         -Phil

                         Asterios Katsifodimos wrote:

                             Hello all,

                             I have been trying to install PVFS 2.8.1 on
        Ubuntu
               server,
                             Centos4 and
                             Scientific Linux 4. I compile it and can
        run it on
               a "single
                             host" configuration
                             without any problems.

                             However, when I add more nodes to the
                      configuration(always using the
                             pvfs2-genconfig defaults ) I have the following
               problem:

                             *On the metadata node I get these messages:*
                             [E 04/02 20:16] batch_create request got:
        Invalid
               argument
                             [E 04/02 20:16] batch_create request got:
        Invalid
               argument
                             [E 04/02 20:16] batch_create request got:
        Invalid
               argument
                             [E 04/02 20:16] batch_create request got:
        Invalid
               argument


                             *In the IO nodes I get:*
                             [r...@wn140 ~]# tail -50 /tmp/pvfs2-server.log
                             [D 04/02 23:53] BMI_testcontext completing:
                      18446744072456767880
                             [D 04/02 23:53] [SM Entering]: (0x88f8b00)
                             msgpairarray_sm:complete (status: 1)
                             [D 04/02 23:53] [SM frame get]: (0x88f8b00)
        op-id: 37
                      index: 0
                             base-frm: 1
                             [D 04/02 23:53] msgpairarray_complete: sm
        0x88f8b00
                             status_user_tag 1 msgarray_count 1
                             [D 04/02 23:53]   msgpairarray: 1
        operations remain
                             [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
                             msgpairarray_sm:complete (error code:
               -1073742006), (action:
                             DEFERRED)
                             [D 04/02 23:53] [SM Entering]: (0x88f8b00)
                             msgpairarray_sm:complete (status: 0)
                             [D 04/02 23:53] [SM frame get]: (0x88f8b00)
        op-id: 37
                      index: 0
                             base-frm: 1
                             [D 04/02 23:53] msgpairarray_complete: sm
        0x88f8b00
                             status_user_tag 0 msgarray_count 1
                             [D 04/02 23:53]   msgpairarray: all operations
               complete
                             [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
                             msgpairarray_sm:complete (error code: 190),
        (action:
                      COMPLETE)
                             [D 04/02 23:53] [SM Entering]: (0x88f8b00)
                             msgpairarray_sm:completion_fn (status: 0)
                             [D 04/02 23:53] [SM frame get]: (0x88f8b00)
        op-id: 37
                      index: 0
                             base-frm: 1
                             [D 04/02 23:53] (0x88f8b00) msgpairarray state:
               completion_fn
                             [E 04/02 23:53] Warning: msgpair failed to
               tcp://wn141:3334,
                             will retry: Connection refused
                             [D 04/02 23:53] *** msgpairarray_completion_fn:
               msgpair 0
                             failed, retry 1
                             [D 04/02 23:53] *** msgpairarray_completion_fn:
               msgpair
                      retrying
                             after delay.
                             [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
                             msgpairarray_sm:completion_fn (error code:
        191),
               (action:
                      COMPLETE)
                             [D 04/02 23:53] [SM Entering]: (0x88f8b00)
                             msgpairarray_sm:post_retry (status: 0)
                             [D 04/02 23:53] [SM frame get]: (0x88f8b00)
        op-id: 37
                      index: 0
                             base-frm: 1
                             [D 04/02 23:53] msgpairarray_post_retry: sm
        0x88f8b00,
                      wait 2000 ms
                             [D 04/02 23:53] [SM Exiting]: (0x88f8b00)
                             msgpairarray_sm:post_retry (error code: 0),
        (action:
                      DEFERRED)
                             [D 04/02 23:53] [SM Entering]: (0x89476c0)
                             perf_update_sm:do_work (status: 0)
                             [P 04/02 23:53] Start times (hr:min:sec):
                23:53:11.330
                              23:53:10.310  23:53:09.287  23:53:08.268
                23:53:07.245
                              23:53:06.225
                             [P 04/02 23:53] Intervals (hr:min:sec)  :
                00:00:01.026
                              00:00:01.020  00:00:01.023  00:00:01.019
                00:00:01.023
                              00:00:01.020
                             [P 04/02 23:53]
------------------------------------------------------------------------------------------------------------- [P 04/02 23:53] bytes read : 0 0 0 0 0 0 [P 04/02 23:53] bytes written : 0 0 0 0 0 0 [P 04/02 23:53] metadata reads : 0 0 0 0 0 0 [P 04/02 23:53] metadata writes : 0 0 0 0 0 0 [P 04/02 23:53] metadata dspace ops : 0 0 0 0 0 0 [P 04/02 23:53] metadata keyval ops : 1 1 1 1 1 1 [P 04/02 23:53] request scheduler : 0 0 0 0 0 0
                             [D 04/02 23:53] [SM Exiting]: (0x89476c0)
                      perf_update_sm:do_work
                             (error code: 0), (action: DEFERRED)
                             [D 04/02 23:53] [SM Entering]: (0x8948810)
                      job_timer_sm:do_work
                             (status: 0)
                             [D 04/02 23:53] [SM Exiting]: (0x8948810)
                      job_timer_sm:do_work
                             (error code: 0), (action: DEFERRED)
                             [D 04/02 23:53] [SM Entering]: (0x89476c0)
                             perf_update_sm:do_work (status: 0)
                             [P 04/02 23:53] Start times (hr:min:sec):
                23:53:12.356
                              23:53:11.330  23:53:10.310  23:53:09.287
                23:53:08.268
                              23:53:07.245
                             [P 04/02 23:53] Intervals (hr:min:sec)  :
                00:00:01.020
                              00:00:01.026  00:00:01.020  00:00:01.023
                00:00:01.019
                              00:00:01.023
                             [P 04/02 23:53]
------------------------------------------------------------------------------------------------------------- [P 04/02 23:53] bytes read : 0 0 0 0 0 0 [P 04/02 23:53] bytes written : 0 0 0 0 0 0 [P 04/02 23:53] metadata reads : 0 0 0 0 0 0 [P 04/02 23:53] metadata writes : 0 0 0 0 0 0 [P 04/02 23:53] metadata dspace ops : 0 0 0 0 0 0 [P 04/02 23:53] metadata keyval ops : 1 1 1 1 1 1 [P 04/02 23:53] request scheduler : 0 0 0 0 0 0
                             [D 04/02 23:53] [SM Exiting]: (0x89476c0)
                      perf_update_sm:do_work
                             (error code: 0), (action: DEFERRED)
                             [D 04/02 23:53] [SM Entering]: (0x8948810)
                      job_timer_sm:do_work
                             (status: 0)
                             [D 04/02 23:53] [SM Exiting]: (0x8948810)
                      job_timer_sm:do_work
                             (error code: 0), (action: DEFERRED)


                             The metadata node keeps asking for
        something that
               the IO
                      nodes
                             cannot give
                             the right way. So it complains. This makes the
               nodes and the
                             metadata node
                             not to work.

                             I have installed those services many times.
        I have
               tested
                      this
                             using berkeley
                             db 4.2 and 4.3 on Redhat systems(centos,
        scientific
                      linnux) and
                             on Ubuntu server.

                             I have also tried the PVFS version 2.6.3
        and I get the
                      same problem.

                             *My config files look like:*
                             [r...@wn140 ~]# more /etc/pvfs2-fs.conf
                             <Defaults>
                                UnexpectedRequests 50
                                EventLogging all
                                EnableTracing no
                                LogStamp datetime
                                BMIModules bmi_tcp
                                FlowModules flowproto_multiqueue
                                PerfUpdateInterval 1000
                                ServerJobBMITimeoutSecs 30
                                ServerJobFlowTimeoutSecs 30
                                ClientJobBMITimeoutSecs 300
                                ClientJobFlowTimeoutSecs 300
                                ClientRetryLimit 5
                                ClientRetryDelayMilliSecs 2000
                                PrecreateBatchSize 512
                                PrecreateLowThreshold 256

                                StorageSpace /pvfs
                                LogFile /tmp/pvfs2-server.log
                             </Defaults>

                             <Aliases>
                                Alias wn140 tcp://wn140:3334
                                Alias wn141 tcp://wn141:3334
                             </Aliases>

                             <Filesystem>
                                Name pvfs2-fs
                                ID 320870944
                                RootHandle 1048576
                                FileStuffing yes
                                <MetaHandleRanges>
                                    Range wn140 3-2305843009213693953
                                    Range wn141
               2305843009213693954-4611686018427387904
                                </MetaHandleRanges>
                                <DataHandleRanges>
                                    Range wn140
               4611686018427387905-6917529027641081855
                                    Range wn141
               6917529027641081856-9223372036854775806
                                </DataHandleRanges>
                                <StorageHints>
                                    TroveSyncMeta yes
                                    TroveSyncData no
                                    TroveMethod alt-aio
                                </StorageHints>
                             </Filesystem>


                             My setup is made from two nodes that are
        both IO
               and Metadata
                             nodes. I have also tried
                             a 4 node setup with 2I/O - 2 MD nodes
        resulting in the
                      same thing.

                             Any suggestions?

                             thank you in advance,
                             --
                             Asterios Katsifodimos
                             High Performance Computing systems Lab
                             Department of Computer Science, University
        of Cyprus
                             http://www.asteriosk.gr
        <http://www.asteriosk.gr/>


------------------------------------------------------------------------

                             _______________________________________________
                             Pvfs2-users mailing list
                             [email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>
                             <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>>

http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users










_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to