Hi Rob,
Thanks for taking time to look at this.

I'm still wondering if the warnings I saw when compiling PVFS 2.6.2
are related (see my earlier post on upgrading from 1.5.1 to 2.6.2)

Are you seeing any block device errors in your dmesg/syslog output?
Given the fsck that you had to perform previously, it is possible that
there's a problem with the local disk or FS.

I'm not sure hat lay behind those disk errors - I've changed a power
supply and haven't seen any more.  I've also completely rebuilt the
cluster since then - so that PVFS 2.6.2 was upgraded froma on a fresh
install of 1.5.1.

No errors in dmesg, messages or syslog - I now see some in the pvfs2-
client.log (attached, see below for a description of what I do on the
machines during this time)

What local FS are you using, by the way?

ext3 on a raid 0, all nodes are partitioned identically.

To elaborate/clarify my earlier post:

I take 3 DVD that have previously been copied to the PVFS area: DVD A,
B, C, thr files on these all passed the diff verification after they
were copied.

I boot the cluster - the two PVFS2 I/O compute nodes start fine,
the frontend hangs when loading the PVFS2-client (server loads [ OK
]).  After a hard reset, then checking the filesystem the frontend
boots fine.

Load  DVD-A on to the frontend and run the script
copy-taq-dvd-monitor.sh, which calls copy-taq-dvd.sh (both attached)
Two binary files fail the verification check, they are deleted,
coppied and verified - both pass.
Now all the files on DVD-A have been verified once.

Reinsert DVD-A on the frontend and rerun the script - all files pass
the verification - no files need to be deleted/recopied/verified.
Now all the files on the PVFS2 area have been _twice_ verified as the
same as the DVD.

Insert DVD-B and DVD-C on the two nodes and DVD-A on the frontend, run
the same script on all three machines.
Now 3 files on DVD-A fail the initial verification.  One of these is
the same that failed the first 'run'.
DVD-B and C have 3-4 binary files that fail.
All failed files are deleted/copied/verified and pass the verification.

Stop the script running on the two compute nodes (DVD-B, C), restart
the script on the frontend to re-process DVD-A.
The diff fails reporting a 'broken pipe', attempts to restart the
script result in the `cp` operation for this file hanging.
At this point I stopped.

I have 3 terminals open showing:
tail -f /var/log/messages
tail -f /var/log/dmsg
tail -f /var/log/syslog
I see no output to these files throughout this exercise.

The pvfs2-client.log files from each machine are attached.

It seems this problem occurs when PVFS2 is under load?

Hope this helps.
Mark

Thanks,

Rob

Mark Van De Vyver wrote:
> Thanks Steve,
> I don't see any problem until I run the diff or cmp and even then
> these indicate the files are identical if the cmp is run _immediately_
> after the file copy.
> cmp and diff only indicate a difference when a file is 'checked' after
> some other files have been copied-checked.
>
> The files are from the NYSE trade and quote (TAQ) DVD's, so they are
> text stored as binary.
>
> You might be able to try the following with a dozen or so large binary
> files, I have approx 300-400GB stored in the PVFS area.
>
> Ideally the following should be run on two or more PVFS2 servers at
> the same time, apply this to several DVD's that have not been copied
> to the PVFS area, then reapply the script to the same DVD's after they
> have been copied.
> The following is a slightly simplified version of my script - here I
> don't delete and re-copy when an existing file fails the cmp
> verification:
>
> # untested script start
> for fn in `ls /dvd/*large.bin|sed -e 's/\/dev\//g'`
>  do
>      if [ -f /mnt/pvfs2/${fn} ]
>        then
>          # This should 'fail' more frequently than the cmp in the else
> clause
>          cmp ${fn} /mnt/pvfs2/${fn}
>          if [ $? != 0 ]
>            then
>              echo "Prexisting copy not exact - more frequent and random?"
>          fi
>        else
>           cp ${fn} /mnt/pvfs2/${fn}
>           cmp ${fn} /mnt/pvfs2/${fn}
>           if [ $? != 0 ]
>              then
>                echo "    Initial copy not exact - less frequent and random"
>        fi
>  done
> # untested script end
>
> Regards
> Mark
>
> On 3/2/07, Steve <[EMAIL PROTECTED]> wrote:
>> My setup is a little different in that at the moment I have 2 I/O
>> services
>> running on one box, a metadata on another and a client/samba server on a
>> third. I have moved in the data via samba. We have copied in mp3's and
>> avi/mpg's as well as large ISO's plus software exe's. Surely after
>> several
>> week of use we would notice some problem ?
>>
>>
>>
>> I do have another box set up as a client that happens to have a dvd ROM
>> drive in it.
>>
>>
>>
>> What type of files ? A vob ?
>>
>> What sequence of  commands would I need to do you test your problem ?
>>
>> If I get a little spare time I could try for U ?
>>
>>
>>
>> Steve
>>
>>
>>
>> -------Original Message-------
>>
>>
>>
>> From: Mark Van De Vyver
>>
>> Date: 02/03/2007 08:18:11
>>
>> To: Steve
>>
>> Subject: Re: [Pvfs2-users] PVFS 2.6.2 intermittent cmp/diff failure
>>
>>
>>
>> Hi Steve,
>>
>>
>>
>> > Not sure if this helps any but I have copied over 500gb of media
>> files to
>>
>> > pvfs2 running on old dell's 533 to 866 CPU with very little ram
>> running on
>>
>>
>> > caos3 beta 3. Although I havent done any checks other than using the
>> media
>>
>>
>> > I havent noticed any problems.
>>
>> >
>>
>>
>>
>> The failures might be spurious....?
>>
>>
>>
>> > Could you have problems with the dvd device ?
>>
>>
>>
>> I doubt it - but it may not be impossible?
>>
>> This happens with the DVD drives on all three nodes, and when I just
>>
>> Have one node 'working the diif/cmp failures either don't occur or
>>
>> Very, very rarely. Start all three nodes 'working' and I see roughly
>>
>> 1 out of 2 binary files fail the initial diff/cmp check, but very very
>>
>> Few (one every couple of DVD's fail the cmp/diff check immediately
>>
>> After the copy is done.....
>>
>>
>>
>> Thanks
>>
>> Mark
>>
>>
>>
>> >
>>
>> > -------Original Message-------
>>
>> >
>>
>> >
>>
>> >
>>
>> > From: Mark Van De Vyver
>>
>> >
>>
>> > Date: 02/03/2007 03:26:40
>>
>> >
>>
>> > To: [email protected]
>>
>> >
>>
>> > Subject: [Pvfs2-users] PVFS 2.6.2 intermittent cmp/diff failure
>>
>> >
>>
>> >
>>
>> >
>>
>> > Hi,
>>
>> >
>>
>> > This is a follow up on an earlier email where I reported that PVFS
>>
>> >
>>
>> > 1.5.1 failed copy binary files from several DVD's.
>>
>> >
>>
>> >
>>
>> >
>>
>> > I'm running a 3 node Rocks 4.2.1 Cluster, CentOS4.4, x86_64, nodes are
>>
>> >
>>
>> > Connected via an unmanaged switch.
>>
>> >
>>
>> >
>>
>> >
>>
>> > I have reinstalled the Rocks Cluster (all nodes), including the
>> PVFS2 Roll
>>
>>
>> >
>>
>> > The cluster is set up with the frontend as the metadaat server and the
>>
>> >
>>
>> > Other two nodes are PVFS2 I/O servers and clients. The /mnt.pvfs2
>>
>> >
>>
>> > Area is on a 3 disk RAID 0 partition formatted as ext3.
>>
>> >
>>
>> > After installing I ran the test steps in the "PVFS2 Quick Start
>>
>> >
>>
>> > Guide". The test steps ran without error.
>>
>> >
>>
>> > I upgraded to PVFS 2.6.2 on all nodes and re-ran the test steps, again
>>
>> >
>>
>> > No errors or problems.
>>
>> >
>>
>> >
>>
>> >
>>
>> > I build PVFS 2.6.2 with the following:
>>
>> >
>>
>> >
>>
>> >
>>
>> > ./configure --with-kernel=</path/to/kernel26/>
>>
>> >
>>
>> > --enable-kernel-sendfile --prefix=/usr/local/pvfs2/
>>
>> >
>>
>> > Then type
>>
>> >
>>
>> > Make all
>>
>> >
>>
>> > Make kmod_install
>>
>> >
>>
>> > Make install
>>
>> >
>>
>> >
>>
>> >
>>
>> > On each node I have a script that lists the files on the DVD disc
>>
>> >
>>
>> > Loaded on that node.
>>
>> >
>>
>> > Each file is copied if it does not exist on the HDD (PVFS area) and
>>
>> >
>>
>> > The copy is immediately verified:
>>
>> >
>>
>> >
>>
>> >
>>
>> > Cp /dvd/file1 /mnt/pvfs2/file1
>>
>> >
>>
>> > Cmp /dvd/file1 /mnt/pvfs2/file1
>>
>> >
>>
>> >
>>
>> >
>>
>> > `cmp` does not report any error.
>>
>> >
>>
>> > This has been done for 60-70 DVD.
>>
>> >
>>
>> >
>>
>> >
>>
>> > If I insert a DVD that has previously been copied my script finds that
>>
>> >
>>
>> > A file exists in the PVFS area and does a `cmp` with the DVD file, if
>>
>> >
>>
>> > The file fails this comparison the file is deleted, copied, verified
>>
>> >
>>
>> > (cmp).
>>
>> >
>>
>> >
>>
>> >
>>
>> > I notice that frequently and randomly the previously copied files will
>>
>> >
>>
>> > Fail the _initial_ `cmp` check if more than one node is 'active', I.e.
>>
>> >
>>
>> > Processing a DVD.
>>
>> >
>>
>> > Once deleted and copied the second `cmp` check is passed.
>>
>> >
>>
>> >
>>
>> >
>>
>> > Some details:
>>
>> >
>>
>> > The files do not fail the `cmp` check immediately after being copied -
>>
>> >
>>
>> > Only when checking a previously copied file.
>>
>> >
>>
>> > The `cmp` result indicates a different byte at which the files differ.
>>
>> >
>>
>> > Re-inserting the same dvd several times results if different files
>>
>> >
>>
>> > Failing the first `cmp` check.
>>
>> >
>>
>> > The second check (immediately after the copy is finished) is always
>> passed
>>
>>
>> >
>>
>> > This occurs rarely, if at all (I.e. I haven't noticed it), when only
>>
>> >
>>
>> > One node is processing a DVD.
>>
>> >
>>
>> > This only occurs with binary files - which are relatively large
>> 200MB - 2
>> GB
>>
>> >
>>
>> >
>>
>> > This never occurs with text files - which are also small 100'sKB
>>
>> >
>>
>> > The pvfs2-client.log file is empty on each node.
>>
>> >
>>
>> > I have tried using diff and experience the same results.
>>
>> >
>>
>> >
>>
>> >
>>
>> > This is similar to an error I was seeing in PVFS 1.5.1 - hence the
>>
>> >
>>
>> > Upgrade. I've also changed my previous script which `dd` copied the
>>
>> >
>>
>> > DVD to memory (approx 8GB), then wrote this ISO file to the PVFS2 area
>>
>> >
>>
>> > - this worked fine for initial copies, but failed for re-copies. At
>>
>> >
>>
>> > That time I wasn't verifiying the copy, so it was the copy to the
>>
>> >
>>
>> > PVFS2 area that failed.....
>>
>> >
>>
>> >
>>
>> >
>>
>> > Finally, on one occasion when manually running `cmp` on a file I
>>
>> >
>>
>> > Noticed the following sequence.
>>
>> >
>>
>> > Cmp file1 file2 (pass)
>>
>> >
>>
>> > Cmp file1 file2 (pass)
>>
>> >
>>
>> > Difffile1 file2 (fail)
>>
>> >
>>
>> > Cmp file1 file2 (fail)
>>
>> >
>>
>> >
>>
>> >
>>
>> > Is this known behavior with a known workaround/configuration setting?
>>
>> >
>>
>> > The behavior I see made me guess a caching or network issue (there are
>>
>> >
>>
>> > No other machines on the cluster network).
>>
>> >
>>
>> > Can anyone suggest PVFS configuration settings that will make PVFS more
>>
>> > robust.
>>
>> >
>>
>> >
>>
>> >
>>
>> > I'm not a programmer or linux guru - I just spent this summer
>>
>> >
>>
>> > Converting from winxp...
>>
>> >
>>
>> > I'm happy to explore some possible fixes, but don't assume too much :)
>>
>> >
>>
>> >
>>
>> >
>>
>> > Thanks in advance
>>
>> >
>>
>> > Mark
>>
>> >
>>
>> > _______________________________________________
>>
>> >
>>
>> > Pvfs2-users mailing list
>>
>> >
>>
>> > [email protected]
>>
>> >
>>
>> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>
>> >
>>
>> >
>>
>> >
>>
>>
>>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>

Attachment: copy-taq-dvd-monitor.sh
Description: Bourne shell script

Attachment: copy-taq-dvd.sh
Description: Bourne shell script

Attachment: pvfs2-client.log.frontend
Description: Binary data

Attachment: pvfs2-client.log.compute-0-0
Description: Binary data

Attachment: pvfs2-client.log.compute-0-1
Description: Binary data

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to