Re: [reiserfs-list] 2.4.19-pre7 / corruption on unwanted reboot

2002-05-06 Thread Hans Reiser

Dirk Mueller wrote:

>Hi, 
>
>I've seen HEAVY file corruption on unwanted reboots (like pressing the reset 
>button accidently) on reiserfs with this kernel on 3 machines now. 
>
>The symptom is that it finds a LOT of files to "unlink" on journal replay, 
>which I find suspicious as those machines are lightly loaded. 
>
>I didn't follow the development too closely the last few weeks, but I 
>believe that something turned worse in this respect lately. 
>
>Note that reiserfsck doesn't find any error in the file system structure 
>before and after the journal replay on reboot, 
>still many files (especially those that were not touched for several hours 
>before the reboot) contain complete garbage after the journal replay. 
>
>
>Dirk
>
>
>  
>
Were these files being written to near the time of the reboot?

hans





Re: [reiserfs-list] Performance question

2002-05-06 Thread Hans Reiser

glob is implemented by the shell not the filesystem.  This is not for 
good reason, it just is.  We could write something for you to do it in 
the filesystem and it would be faster.  Is your need for speed critical 
enough to justify writing something special for it?

Hans


Oleg Drokin wrote:

>Hello!
>
>On Sun, May 05, 2002 at 04:20:13PM +0200, Philipp G?hring wrote:
>
>  
>
>>Let's say I have a directory with 100.000 files in it.
>>The filenames look like
>>name1_name2_name3_id
>>So I have
>>001_41052_50125_1
>>001_63216_1212_1
>>I have to create a search engine, that serves for example the 4th Block of 10 
>>files that match the query "001_*_1212_1". The how query would result to 100 
>>files, that are spread across the directory.
>>Now my question:
>>Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which 
>>should result to about 100 entries, and then take the entries 40 to 49 from 
>>the resulting array? 
>>(Is ReiserFS able to directly return 100 files out of 10 with the 
>>globbing function, or is it an iteration over all files in the directory?)
>>
>>
>
>*glob functions are implemented by various library functions, that do full
>readdir scans at least once, I believe.
>
>  
>
>>Or should I do 2 opendir-readdir loops, one to read over the first 39 
>>results, that I do not need, and the second one to geht the results 40 to 49?
>>
>>
>
>In fact I do not see why do you need to do 2 opendir-readdir loops.
>One loop should be enough.
>You just compare each filename returned against your query and and if it matched
>remember it in separate list. So at the end of readdir loop you have a list of
>all names in a directory that match your query. And you can apply any additional
>check in place just not to remember unnecesary files.
>
>  
>
>>The problem here is that I have to readdir about 5 files (4 to get 
>>through the unneeded results, and 1 to get the 10 results i need)
>>But on the other hand, I do not have to remember 100 files, from which I only 
>>need 10.
>>
>>
>
>I am completely missing the idea on where these numbers are from. Can you
>explain in more details.
>
>  
>
>>If ReiserFS has to iterate over 10 files (the whole directory) to do a 
>>"001_*_1212_1" glob, because the binary tree only speeds up known files, but 
>>not patterns, then opendir-readdir should be faster, I guess.
>>
>>
>
>Binary tree is only helps when you know filename, I believe. You calculate
>a hash and out of that hash you can quickly find desired location.
>You you come up with a hash that places all filenames like your one near one,
>this will help, then.
>
>  
>
>>Another option would be to use subdirectories like
>>name1/name2/name3/id
>>So the glob would be "001/*/1212/1", which should be faster, anyway.
>>But on the other hand, I would have to do a lot more directory management, 
>>creating and deleting directories ...
>>And implementing an opendir-readdir search through "001/*/1212/1" will be 
>>more work too.
>>
>>
>
>Readdir would require less iterations through 001/*, because number of
>entries will be only 100 as you described above.
>You get all these 100 entries and then loop 100 times trying to open
>001/${next_name}/1212/1 and deciding whenever you need this file or not.
>(If it exists of course, or you might get -ENOENT and proceed to next
>directory).
>Also deleting directories would be an overkill.
>I think this might be faster in many circumfstances.
>Also what you've descrived looks very like to what squid does. And squid people
>went to reiserfs-raw interface and are quite happy with it.
>
>
>Bye,
>Oleg
>
>
>  
>






Re: [reiserfs-list] Error Code 255 and "Permission Denied"

2002-05-06 Thread Oleg Drokin

Hello!

On Sun, May 05, 2002 at 12:18:28PM -0400, Daniel Christiansen wrote:

> The message in the logfile was"Node (8272) with wrong level (0) found in
> the tree (should be 1)." [There was also a message on the screen of
> "pass_through_tree: unable to read 2949119 block on device 0x4."]  The
> --fix-fixable option didn't seem to do anything.
> [Although, if I recall correctly, there was a message having to do with
> resizing that I didn't understand.]

Can you find out that message in your logs? It may be important one.

> I started to use the --rebuild-tree option but was dissuaded by the
> message about only using it if I was desperate.  Perhaps I need to take
> a deep breath and use this option.

Make sure to get latest reiserfsprogs (v3.x.1b) from namesys ftp site.

> Other indications of a problem:  I ran the dmesg program from /bin and
> got "Warning log replay starting on readonly filesystem" and lots of
> "i/o failure trying to find stat data" messages.

Your filesystem was corrupted by something. Do you have Windows on that box,
too?

> "kernel: is_tree_node: node level 0 does not match to the expected one 1
> kernel: vs-5150: search_by_key: invalid format found in block 8272.
> Fsck?"

This message confirms damaged blocks theory.

> I would appreciate any suggestions as to how fix my problem.

Get latest reiserfsprogs package and run reiserfsck --rebuild-tree.

Bye,
Oleg



Re: [reiserfs-list] 2.4.19-pre7 / corruption on unwanted reboot

2002-05-06 Thread Dirk Mueller

On Sam, 04 Mai 2002, Chris Mason wrote:

> Hmmm, not good at all.  Are these 3 systems IDE or scsi?  Do they run
> additional patches on top of pre7?  What kernels < pre7 have you tried
> that didn't show this problem?

All IDE. The kernel that didn't show this problem was 2.4.16 (plain). No 
additional patches on 2.4.19-pre7. 


Dirk



Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Chris Mason

On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
>
> So how about if you revise fsync so that it always sends data blocks to 
> the journal not to the main disk?

This gets a little sticky.

Once you log a block, it might be replayed after a crash.  So, you have
to protect against corner cases like this:

write(file)
fsync(file) ; /* logs modified data blocks */
write(file) ; /* write the same blocks without fsync */
sync ;/* use expects new version of the blocks on disk */


During replay, the logged data blocks overwrite the blocks sent to disk
via sync().

This isn't hard to correct for, every time a buffer is marked dirty, you
check the journal hash tables to see if it is replayable, and if so you
log it instead (the 2.2.x code did this due to tails).  This translates
to increased CPU usage for every write.

I'd rather not put it back in because it adds yet another corner case to
maintain for all time.  Most of the fsync/O_SYNC bound applications are
just given their own partition anyway, so most users that need data
logging need it for every write.

-chris







Re: [reiserfs-list] Performance question

2002-05-06 Thread Philipp G?hring

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello!

Thank you Oleg for your answers.

> *glob functions are implemented by various library functions, that do full
> readdir scans at least once, I believe.

I thought I heard about a syscall, that makes it possible to pass the glob to 
the filesystem, so that the filesystem can optimize globbings as it likes, 
and pass the result back to the application, but ok.

> > Or should I do 2 opendir-readdir loops, one to read over the first 39
> > results, that I do not need, and the second one to geht the results 40 to
> > 49?
>
> In fact I do not see why do you need to do 2 opendir-readdir loops.
> One loop should be enough.

Yeah. Sure. My mistake. One opendir, and 2 readdir loops. The first one skips 
over unneeded results and the second one serves the data.

> You just compare each filename returned against your query and and if it
> matched remember it in separate list. So at the end of readdir loop you
> have a list of all names in a directory that match your query. And you can
> apply any additional check in place just not to remember unnecesary files.
>
> > The problem here is that I have to readdir about 5 files (4 to
> > get through the unneeded results, and 1 to get the 10 results i need)
> > But on the other hand, I do not have to remember 100 files, from which I
> > only need 10.
>
> I am completely missing the idea on where these numbers are from. Can you
> explain in more details.

I will try so.
I have a table with 10 files. A complete search would result for example 
100 files, which are spread across the whole directory.
About every thousand files, there is one file, that matches the query.
Since the client does not want to get 100 files at once, at first I return 
only 10 results for the first page, and the user can navigate page-wise.

So I built up the scenario where the user now wants the see results 40-49 
from the query "001_*_1212_1", 
which I assume as normal behaviour for my application.

> Binary tree is only helps when you know filename, I believe. 

Ok.

> Readdir would require less iterations through 001/*, because number of
> entries will be only 100 as you described above.
> You get all these 100 entries and then loop 100 times trying to open
> 001/${next_name}/1212/1 and deciding whenever you need this file or not.
> (If it exists of course, or you might get -ENOENT and proceed to next
> directory).
> Also deleting directories would be an overkill.

So the question is, how big that overkill is.
Is there perhaps a benchmark that tested it already?

> I think this might be faster in many circumfstances.
> Also what you've descrived looks very like to what squid does. And squid
> people went to reiserfs-raw interface and are quite happy with it.

I think the difference to squid is that they only need one result, not a part 
of a search, with more than one result.
But I am thinking about using reiserfs-raw too ...
(At the moment flexibility has still more priority for me than raw 
performance)

Many greetings,
- -- 
~ Philipp G?hring  [EMAIL PROTECTED]
~ http://www.livingxml.net/   ICQ UIN: 6588261
~ 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE81WFGlqQ+F+0wB3oRAtYSAJsGgaHnsohasbrjnJEQWAhi4tatSwCfQXDB
dGlKoxKq0vcB0jHMOV6AEWQ=
=heIa
-END PGP SIGNATURE-



Re: [reiserfs-list] 33 bad sectors kills 20GB

2002-05-06 Thread Kuba Ober

On sobota 04 maj 2002 10:58 am, Dax Kelson wrote:
> This is a 20GB filesystem, used dd_rescue to make an image of the drive:
>
> Summary for /dev/hda3 -> hda3.img:
> dd_rescue: (info):
> ipos:  19711944.0k
> opos:  19711944.0k
> xferd:  19711944.0k
> errs: 33
> errxfer:16.5k
> succxfer:  19711927.5k
> avg.rate:11743kB/s
>
> I then ran reiserfsck on the image file:
>
> look_for_lost: 6 files seem to left not linked to lost+found
> Objects without names 1377
> Empty lost dirs removed 10
> Dirs linked to /lost+found: 149
> Dirs without stat data found 5
> Files linked to /lost+found 1228
> Pass 4 - done left 298, 0 /sec
> Deleted unreachable items 117
> Syncing..done
> Done
>
> (That took 3+ hours on a PIII 900, 512MB ram box)
>
> I mounted the image, and pretty much everything is gone.
>
> Is it normal for 33 bad sectors (16.5k) of data (out of 19711944k) to
> completely kill the fs?
>
> Would I have better luck if I used the -A option on dd_rescue?
>
> "-A Always write blocks, zeroed if err (def=no);"

What you did without using -A is following:

fs with errors:

xpbbxxx

after dd:

xpxxx--

Now assume that p is a pointer in the fs data structures that pointed to the 
last block (the last x). Now it points nowhere.

You have reorganized your filesystem without updating all the pointers there 
are in the metadata. Nothing is going to rescue that wreck,  lest some 
by-hand hexediting.

You *absolutely* need to use -A. Or at least that's how I understand things.

You may also need to do reiserfsck with rebuild-tree, although try without it 
first. I hope you have the latest version of reiserfstools -- if not, you 
*have* to get the latest one.

Cheers, Kuba



Re: [reiserfs-list] 2.4.19-pre7 / corruption on unwanted reboot

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 08:36, Dirk Mueller wrote:
> On Sam, 04 Mai 2002, Chris Mason wrote:

[ reiserfs corruption after a crash, 2.4.19pre7 ]

> 
> > Hmmm, not good at all.  Are these 3 systems IDE or scsi?  Do they run
> > additional patches on top of pre7?  What kernels < pre7 have you tried
> > that didn't show this problem?
> 
> All IDE. The kernel that didn't show this problem was 2.4.16 (plain). No 
> additional patches on 2.4.19-pre7. 

Please tell us everything about your IDE config.  Jens and I are already
trying to track down some odd reiserfs + ide problems on 2.4.19pre7, but
so far that was only with our barrier write patches applied.

-chris




RE: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread berthiaume_wayne

I'll add the write caching into the test just for info. Until there
is a way to guaranty the data is safe I'll have to go with no write caching
though. I should have all this testing done by the end of the week.

-Original Message-
From: Chris Mason [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 03, 2002 6:00 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: [reiserfs-list] fsync() Performance Issue


On Fri, 2002-05-03 at 16:35, [EMAIL PROTECTED] wrote:
>   Chris, I have some quick preliminary results for you. I have
> additional testing to perform and haven't run debugreiserfs() yet. If you
> have a preference for which tests to run debugreiserfs() let me know.
>   Base testing was done against 2.4.13 built on RH 7.1 using the
> test_writes.c code I forwarded to you. The system is a Tyan with single
> PIII, IDE Promise 20269, Maxtor 160GB drive - write cache disabled. All
> numbers are with fsync() and 1KB files. As I said, more testing, i.e.
> filesizes, need to be performed.

> 2.4.19-pre7 speedup, data logging, write barrier / no options
>   => 47.1ms/file

Hi Wayne, thanks for sending these along.

I expected a slight improvement over the 2.4.13 code even with the data
logging turned off.  I'm curious to see how it does with the IDE cache
turned on.  With scsi, I see 10-15% better without any options than an
unpatched kernel.

> 2.4.19-pre7 speedup, data logging, write barrier / data=journal
>   => 25.2ms/file
> 2.4.19-pre7 speedup, data logging, write barrier /
data=journal,barrier=none
>   => 27.8ms/file

The barrier option doesn't make much difference because the write cache
is off.  With write cache on, the barrier code should allow you to be
faster than with the caching off, but without risking the data (Jens and
I are working on final fsync safety issues though).

Hans, data=journal turns on the data journaling.  The data journaling
patches also include optimizations to write metadata back to disk in
bigger chunks for tiny transactions (the current method is to write one
transaction's worth back, when a transaction has 3 blocks, this is
pretty slow).

I've put these patches up on:

ftp.suse.com/pub/people/mason/patches/data-logging

>   One question is will these patches be going into the 2.4 tree and
> when?

The data logging patches are a huge change, but the good news is they
are based on the nesting patches that have been stable for a long time
in the quota code.  I'll probably want a month or more of heavy testing
before I think about submitting them.

-chris




Re: [reiserfs-list] Performance question

2002-05-06 Thread Oleg Drokin

Hello!

On Sun, May 05, 2002 at 06:43:45PM +0200, Philipp G?hring wrote:

> > *glob functions are implemented by various library functions, that do full
> > readdir scans at least once, I believe.
> I thought I heard about a syscall, that makes it possible to pass the glob to 
> the filesystem, so that the filesystem can optimize globbings as it likes, 
> and pass the result back to the application, but ok.

I do not think something like that exists in Linux. But if you
come up with man page from section 2...

> > > Or should I do 2 opendir-readdir loops, one to read over the first 39
> > > results, that I do not need, and the second one to geht the results 40 to
> > > 49?
> > In fact I do not see why do you need to do 2 opendir-readdir loops.
> > One loop should be enough.
> Yeah. Sure. My mistake. One opendir, and 2 readdir loops. The first one skips 
> over unneeded results and the second one serves the data.

No. Still I think you need only one loop anyway, like this:

DIR=opendir(name);
while((result=readdir(DIR)) != NULL) {
if ( check_filename_criteria(result->filename) ) {
add_to_list_of_files_to_process(result->filename);
}
}
for i in list_of_files_to_process {
process_file(i);
}

So only one loop, and the second one does not count because it is serves
actual data.

> > > The problem here is that I have to readdir about 5 files (4 to
> > > get through the unneeded results, and 1 to get the 10 results i need)
> > > But on the other hand, I do not have to remember 100 files, from which I
> > > only need 10.
> > I am completely missing the idea on where these numbers are from. Can you
> > explain in more details.
> I will try so.
> I have a table with 10 files. A complete search would result for example 
> 100 files, which are spread across the whole directory.
> About every thousand files, there is one file, that matches the query.
> Since the client does not want to get 100 files at once, at first I return 
> only 10 results for the first page, and the user can navigate page-wise.
> So I built up the scenario where the user now wants the see results 40-49 
> from the query "001_*_1212_1", 
> which I assume as normal behaviour for my application.

Ah, I see what you mean. If you have a lot of resources, you can setup a session
and store all the search results for that session at server side.
So when second request comes in, you just read search result from the session.
Also you kill the session for 5 minutes after 5 minutes of inactivity on it or
so. Hm... This requires for cookies to be enabled, though. ;)

> > Readdir would require less iterations through 001/*, because number of
> > entries will be only 100 as you described above.
> > You get all these 100 entries and then loop 100 times trying to open
> > 001/${next_name}/1212/1 and deciding whenever you need this file or not.
> > (If it exists of course, or you might get -ENOENT and proceed to next
> > directory).
> > Also deleting directories would be an overkill.
> So the question is, how big that overkill is.

I mean that you do not need to delete directories, when they are empty.
You only need to create the directory structure once.

> Is there perhaps a benchmark that tested it already?

No, I do not think so, but feel free to compose and run your own benchmark.

> > I think this might be faster in many circumfstances.
> > Also what you've descrived looks very like to what squid does. And squid
> > people went to reiserfs-raw interface and are quite happy with it.
> I think the difference to squid is that they only need one result, not a part 
> of a search, with more than one result.

Hm. This is true.

Bye,
Oleg



Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Hans Reiser

Chris Mason wrote:

>On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
>  
>
>>So how about if you revise fsync so that it always sends data blocks to 
>>the journal not to the main disk?
>>
>>
>
>This gets a little sticky.
>
>Once you log a block, it might be replayed after a crash.  So, you have
>to protect against corner cases like this:
>
>write(file)
>fsync(file) ; /* logs modified data blocks */
>write(file) ; /* write the same blocks without fsync */
>sync ;/* use expects new version of the blocks on disk */
>
>
>During replay, the logged data blocks overwrite the blocks sent to disk
>via sync().
>
>This isn't hard to correct for, every time a buffer is marked dirty, you
>check the journal hash tables to see if it is replayable, and if so you
>log it instead (the 2.2.x code did this due to tails).  This translates
>to increased CPU usage for every write.
>
Significant increased CPU usage?

>
>I'd rather not put it back in because it adds yet another corner case to
>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>just given their own partition anyway, so most users that need data
>logging need it for every write.
>
most users don't know enough to turn it on;-)

>
>-chris
>
>
>
>
>
>
>  
>






Re: [reiserfs-list] 2.4.19-pre7 / corruption on unwanted reboot

2002-05-06 Thread Dirk Mueller

On Mon, 06 Mai 2002, Chris Mason wrote:

> Please tell us everything about your IDE config.  Jens and I are already
> trying to track down some odd reiserfs + ide problems on 2.4.19pre7, but
> so far that was only with our barrier write patches applied.

There is not much common. two of them are VIA 686 southbridge (KT133A, 
KT333), one is something older, a Pentium chipset. 

DMA 100 and DMA 66. We all use those Maxtor 80GB EIDE disks. 


Dirk



Re: [reiserfs-list] 2.4.19-pre7 / corruption on unwanted reboot

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 09:59, Dirk Mueller wrote:
> On Mon, 06 Mai 2002, Chris Mason wrote:
> 
> > Please tell us everything about your IDE config.  Jens and I are already
> > trying to track down some odd reiserfs + ide problems on 2.4.19pre7, but
> > so far that was only with our barrier write patches applied.
> 
> There is not much common. two of them are VIA 686 southbridge (KT133A, 
> KT333), one is something older, a Pentium chipset. 
> 
> DMA 100 and DMA 66. We all use those Maxtor 80GB EIDE disks. 

Any suggestions on how I might reproduce locally?

-chris





Re: [reiserfs-list] Error Code 255 and "Permission Denied"

2002-05-06 Thread Daniel Christiansen

>>> Oleg Drokin <[EMAIL PROTECTED]> 05/06/02 07:21 AM >>>

Make sure to get latest reiserfsprogs (v3.x.1b) from namesys ftp site.

I did this and installed it.  But, of course, my rescue disk has the old
3.x.0j version not the new 3.x.1b version.  How can I use the new
version?

> Other indications of a problem:  I ran the dmesg program from /bin and
> got "Warning log replay starting on readonly filesystem" and lots of
> "i/o failure trying to find stat data" messages.

Your filesystem was corrupted by something. Do you have Windows on that
box,
too?

I have windows on the first drive, which I rarely use, and a vfat
windows partition on the second drive so that I could transfer files.

I'm not sure, but I think my problems started after a power outage.


> "kernel: is_tree_node: node level 0 does not match to the expected one
1
> kernel: vs-5150: search_by_key: invalid format found in block 8272.
> Fsck?"

This message confirms damaged blocks theory.

> I would appreciate any suggestions as to how fix my problem.

Get latest reiserfsprogs package and run reiserfsck --rebuild-tree.

I presume I'm supposed to remount my / drive as read only.  I'm not sure
how to do that.  I'll try to find out.


Thanks for the help.

Dan


Bye,
Oleg




Re: [reiserfs-list] Error Code 255 and "Permission Denied"

2002-05-06 Thread Oleg Drokin

Hello!

On Mon, May 06, 2002 at 12:07:26PM -0400, Daniel Christiansen wrote:

> Make sure to get latest reiserfsprogs (v3.x.1b) from namesys ftp site.
> I did this and installed it.  But, of course, my rescue disk has the old
> 3.x.0j version not the new 3.x.1b version.  How can I use the new
> version?

Copy the new version to your rescue floppy.

> > Other indications of a problem:  I ran the dmesg program from /bin and
> > got "Warning log replay starting on readonly filesystem" and lots of
> > "i/o failure trying to find stat data" messages.
> Your filesystem was corrupted by something. Do you have Windows on that
> box,
> too?
> I have windows on the first drive, which I rarely use, and a vfat
> windows partition on the second drive so that I could transfer files.
> I'm not sure, but I think my problems started after a power outage.

Hm. Do you have write cache enabled on your harddrive? That may explain
your problems (and yes, most of drive manufacturers do enable write caching
by default).

> Get latest reiserfsprogs package and run reiserfsck --rebuild-tree.
> I presume I'm supposed to remount my / drive as read only.  I'm not sure
> how to do that.  I'll try to find out.

No. simple remounting won't help, you need to boot off some rescue media and run
reiserfsck on completely unmounted partition.

Bye,
Oleg



Re: [reiserfs-list] 2.4.19-pre7 / corruption on unwanted reboot

2002-05-06 Thread Dirk Mueller

On Mon, 06 Mai 2002, Chris Mason wrote:

> Any suggestions on how I might reproduce locally?

not much. maybe try a lot of open, unlinked files when pressing reset and 
then check the md5sum's of all files..


Dirk



Re: [reiserfs-list] Error Code 255 and "Permission Denied"

2002-05-06 Thread Daniel Christiansen

Everything seems to work by using the 3.x.1b version of reisfsck with
--rebuild-tree.  Thank you very much, Oleg, for taking the time to solve
my problem.

I don't know anything about the "write cache enabled" issue below.  Is
this something I have to change with a jumper, a bios setting, or a
software configuration?

Thanks again.

> > Other indications of a problem:  I ran the dmesg program from /bin
and
> > got "Warning log replay starting on readonly filesystem" and lots of
> > "i/o failure trying to find stat data" messages.
> Your filesystem was corrupted by something. Do you have Windows on
that
> box,
> too?
> I have windows on the first drive, which I rarely use, and a vfat
> windows partition on the second drive so that I could transfer files.
> I'm not sure, but I think my problems started after a power outage.

Hm. Do you have write cache enabled on your harddrive? That may explain
your problems (and yes, most of drive manufacturers do enable write
caching
by default).








Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Hans Reiser

Chris Mason wrote:

>On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
>  
>
>>So how about if you revise fsync so that it always sends data blocks to 
>>the journal not to the main disk?
>>
>>
>
>This gets a little sticky.
>
>Once you log a block, it might be replayed after a crash.  So, you have
>to protect against corner cases like this:
>
>write(file)
>fsync(file) ; /* logs modified data blocks */
>write(file) ; /* write the same blocks without fsync */
>sync ;/* use expects new version of the blocks on disk */
>
>
>During replay, the logged data blocks overwrite the blocks sent to disk
>via sync().
>
>This isn't hard to correct for, every time a buffer is marked dirty, you
>check the journal hash tables to see if it is replayable, and if so you
>log it instead (the 2.2.x code did this due to tails).  This translates
>to increased CPU usage for every write.
>
>I'd rather not put it back in because it adds yet another corner case to
>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>just given their own partition anyway, so most users that need data
>logging need it for every write.
>
Does mozilla's mail user agent use fsync?  Should I give it its own 
partition?  I bet it is fsync bound;-)

Also, I don't think you can reasonably expect most persons to know that 
they should turn data logging on for high fsync performance, even if you 
document it.

Most persons using small fsyncs are using it because the person who 
wrote their application wrote it wrong.  What's more, many of the 
persons who wrote those applications cannot understand that they did it 
wrong even if you tell them (e.g. qmail author reportedly cannot 
understand, sendmail guys now understand but had Kirk McKusick on their 
staff and attending the meeting when I explained it to them so they are 
not very typical).  

In other words, handling stupidity is an important life skill, and we 
all need to excell at it.;-)

Tell me what your thoughts are on the following:

If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
the ones who would never send you an email)  the following 
questions, what percentage will answer which choice?

The filesystem you are using is named:

a) the Performance Optimized SuSE FS

b) NTFS

c) FAT

d) ext2

e) ReiserFS

If you want to change reiserfs to use data journaling you must do which:

a) reinstall the reiserfs package using rpm

b) modify /etc/fs.conf

c) reinstall the operating system from scratch, and select different 
options during the install this time

d) reformat your reiserfs partition using mkreiserfs

e) none of the above

f) all of the above except e)


What do you think the chances are that you can convince Hubert that 
every SuSE Enterprise Edition user should be asked at install time if 
they are going to use fsync a lot on each partition, and to use a 
different fstab setting if yes?

I know that you are an experienced sysadmin who was good at it.  Your 
intuition tells you that most sysadmins are like the ones you were 
willing to hire into your group at the university.  They aren't.

Linux needs to be like a telephone.  You plug it in, push buttons, and 
talk.  It works well, but most folks don't know why.

A moderate number of programs are small fsync bound for the simple 
reason that it is simpler to write them that way.We need to cover 
over their simplistic designs.

So, you have my sympathies Chris, because I believe you that it makes 
the code uglier and it won't be a joy to code and test.  I hope you also 
see that it should be done.

Hans




Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
>
> >I'd rather not put it back in because it adds yet another corner case to
> >maintain for all time.  Most of the fsync/O_SYNC bound applications are
> >just given their own partition anyway, so most users that need data
> >logging need it for every write.
> >
> Does mozilla's mail user agent use fsync?  Should I give it its own 
> partition?  I bet it is fsync bound;-)

[ I took Wayne off the cc list, he's probably not horribly interested ]

Perhaps, but I'll also bet the fsync performance hit doesn't affect the
performance of the system as a whole.  Remember that data=journal
doesn't make the fsyncs fast, it just makes them faster.

> 
> Most persons using small fsyncs are using it because the person who 
> wrote their application wrote it wrong.  What's more, many of the 
> persons who wrote those applications cannot understand that they did it 
> wrong even if you tell them (e.g. qmail author reportedly cannot 
> understand, sendmail guys now understand but had Kirk McKusick on their 
> staff and attending the meeting when I explained it to them so they are 
> not very typical).  
> 
> In other words, handling stupidity is an important life skill, and we 
> all need to excell at it.;-)

A real strength to linux is the application designers can talk directly
to their own personal bottlenecks.  Hopefully we reward those that hunt
us down and spend the time convincing us their applications are worth
tuning for.  They then proceed to beat the pants off their competition.

> 
> Tell me what your thoughts are on the following:
> 
> If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
> the ones who would never send you an email)  the following 
> questions, what percentage will answer which choice?
> 
> The filesystem you are using is named:
> 
> a) the Performance Optimized SuSE FS
> 
> b) NTFS
> 
> c) FAT
> 
> d) ext2
> 
> e) ReiserFS

I believe the ones that know what a filesystem is will answer ReiserFS,
You might get a lot of ext2 answers, just because that's what a lot of
people think the linux filesystem is.

> 
> If you want to change reiserfs to use data journaling you must do which:
> 
> a) reinstall the reiserfs package using rpm
> 
> b) modify /etc/fs.conf
> 
> c) reinstall the operating system from scratch, and select different 
> options during the install this time
> 
> d) reformat your reiserfs partition using mkreiserfs
> 
> e) none of the above
> 
> f) all of the above except e)

These people won't be admins of systems big enough for the difference to
matter.  data journaling is targeted at people with so much load they
would have to buy more hardware to make up for it.  The new option
lowers the price to performance ratio, which is exactly what we want to
do for sendmails, egeneras, lycos, etc.  If it takes my laptop 20ms to
deliver a mail message, cutting the time down to 10ms just won't matter.

> 
> 
> What do you think the chances are that you can convince Hubert that 
> every SuSE Enterprise Edition user should be asked at install time if 
> they are going to use fsync a lot on each partition, and to use a 
> different fstab setting if yes?

Very little, I might tell them to buy the suse email server instead,
since that would have the settings done right.  data=journal is just a
small part of mail server tuning.

> 
> I know that you are an experienced sysadmin who was good at it.  Your 
> intuition tells you that most sysadmins are like the ones you were 
> willing to hire into your group at the university.  They aren't.
> 
> Linux needs to be like a telephone.  You plug it in, push buttons, and 
> talk.  It works well, but most folks don't know why.
> 

Exactly.  I think there are 3 classes of users at play here.

1) Those who don't understand and don't have enough load to notice.
2) Those who don't understand and do have enough load to notice.
3) Those who do understand and do have enough load to notice.

#2 will buy support from someone, and they should be able to configure
the thing right.

#3 will find the docs and do it right themselves.

> A moderate number of programs are small fsync bound for the simple 
> reason that it is simpler to write them that way.We need to cover 
> over their simplistic designs.
> 
> So, you have my sympathies Chris, because I believe you that it makes 
> the code uglier and it won't be a joy to code and test.  I hope you also 
> see that it should be done.

Mostly, I feel this kind of tuning is a mistake right now.  The patch is
young and there are so many places left to tweak...I'm still at the
stage where much larger improvements are possible, and a better use of
coding time.  Plus, it's monday and it's always more fun to debate than
give in on mondays.

-chris





Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Hans Reiser

Chris Mason wrote:

>On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
>  
>
>>>I'd rather not put it back in because it adds yet another corner case to
>>>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>>>just given their own partition anyway, so most users that need data
>>>logging need it for every write.
>>>
>>>  
>>>
>>Does mozilla's mail user agent use fsync?  Should I give it its own 
>>partition?  I bet it is fsync bound;-)
>>
>>
>
>[ I took Wayne off the cc list, he's probably not horribly interested ]
>
>Perhaps, but I'll also bet the fsync performance hit doesn't affect the
>performance of the system as a whole.
>
 I suspect that on my laptop, downloading emails is disk bound due to 
fsync()  I haven't measured it, but it "feels" that way.

>
>Mostly, I feel this kind of tuning is a mistake right now.  The patch is
>young and there are so many places left to tweak...I'm still at the
>stage where much larger improvements are possible, and a better use of
>coding time.  Plus, it's monday and it's always more fun to debate than
>give in on mondays.
>
>-chris
>
>
>
>
>  
>

Needing more time to finish analyzing what is going on and what fixes it 
best is always a good reason to defer things

Hans




Re: [reiserfs-list] Error Code 255 and "Permission Denied"

2002-05-06 Thread Manuel Krause

On 05/06/2002 09:12 PM, Daniel Christiansen wrote:

> Everything seems to work by using the 3.x.1b version of reisfsck with
> --rebuild-tree.  Thank you very much, Oleg, for taking the time to solve
> my problem.
> 
> I don't know anything about the "write cache enabled" issue below.  Is
> this something I have to change with a jumper, a bios setting, or a
> software configuration?
> 


Huh! Maybe you can jumper this on your MB, too?! Or maybe also set it in 
your BIOS?

Use "hdparm -i /dev/drive-whatever" to get this information from your 
disk drive.
If your drive supports it, you 'll find a number in kB after 
"BuffSize=". If your jumpers/BIOS do support it with your current 
settings or your disks defaults do it, search for a 
"WriteCache=enabled". If your BIOS doesn't make it for you it, you can 
enable it by a "hdparm -W1 /dev/drive-whatever" and disable it with 
parameter "-W0" instead of "-W1". And, of course, have a look at the 
manpage -- this feature is marked "(DANGEROUS)" (just in case your 
hardware does not support it)! I have hdparm v4.6 on here since January 
but I just saw a v4.9 on
  ftp://sunsite.unc.edu/pub/Linux/system/hardware and on
  ftp://metalab.unc.edu/pub/Linux/system/hardware/.

> Thanks again.
> 
> 
>>>Other indications of a problem:  I ran the dmesg program from /bin
>>>
> and
> 
>>>got "Warning log replay starting on readonly filesystem" and lots of
>>>"i/o failure trying to find stat data" messages.
>>>
>>Your filesystem was corrupted by something. Do you have Windows on
>>that box, too?
>>I have windows on the first drive, which I rarely use, and a vfat
>>windows partition on the second drive so that I could transfer files.
>>I'm not sure, but I think my problems started after a power outage.
>>
> 
> Hm. Do you have write cache enabled on your harddrive? That may explain
> your problems (and yes, most of drive manufacturers do enable write
> caching
> by default).
> 


Oleg, do you really think a dumb-crashing-Windows to be a reason?? What 
version do you run on your disks? I have a Win98 spread over the 
partitions of my 2 disks but the only things I get from a 
crashed=powered-off Win98 are many-many unusable files in Win98 -- not 
affecting my Linux system.
When I have a complete-crash=power-off when there was a running VMware 
with Win98 inside (but it's a dual boot system with real partitions' 
access for VMware from Linux) I have 2-to-5 truncates-to-complete on 
restart/mount of my reiserfs / and the "usual" vfat problems later... 
like missing files, checks needed and so on)

Maybe it's the kernel <-reiserfs-sub-version-> version Daniel runs at 
the moment? Oleg?

Or is it somekind of connected to Chris Masons thread "2.4.19-pre7 / 
corruption on unwanted reboot"??? If Chris and Jens found bugs on IDE 
interaction with ReiserFS they should really put out a patch soon... ;-)

Chris M.? Is that related eventually? Just a doubt!


Best regards,

Manuel




[reiserfs-list] BTW: 2.4.19-patches-to-come?

2002-05-06 Thread Manuel Krause

Hi!

BTW, for 2.4.19-final it would be very nice to have...

1.) the deleted/truncated/completed-files-on-mount at least
 printed in the kernel logs, at best with the real filename
 -- as afterwards they are not retrievable --
 That's a security reason -- whoever can trigger a crash
 with various methods (I know the admin should take care
 against this case... but on my home sytem I'd like to know
 that info, too) but to get back the file from backups in
 case... who knows it before a
 crash... ? Am I missing something?

2.) a disk/drive/partition distinction in reiserfs related
 messages -- Oleg, you promised it to get real and best
 would be a real "patch" !

3.) a hint on how to turn on/off data-journaling for "some"
 of our existing reiserfs partitions if it exists at all
 for now and why it could be needed in some cases?!.

4.) a hint why there is iicache code in the latest
 speedup-compound-patch (so that the latest iicache patch
 would not apply)


Best regards for your stable ReiserFS, at all,

even under "settings" with 2.4.19-pre7 +reiserfs.pending +latest 
reiserfs.compound-speedup +aa.vm-for-2.4.19-pre7 +akpm.read-latency-2 
+rml.preempt-kernel + rml.lock-break +some-more nice aa.patches... 
That's a valuably fast & interactive experience!


Manuel





Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Manuel Krause

On 05/07/2002 12:57 AM, Chris Mason wrote:

> On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
> 
>>>I'd rather not put it back in because it adds yet another corner case to
>>>maintain for all time.  Most of the fsync/O_SYNC bound applications are
>>>just given their own partition anyway, so most users that need data
>>>logging need it for every write.
>>>
>>>
>>Does mozilla's mail user agent use fsync?  Should I give it its own 
>>partition?  I bet it is fsync bound;-)
>>
> 
> [ I took Wayne off the cc list, he's probably not horribly interested ]
> 
> Perhaps, but I'll also bet the fsync performance hit doesn't affect the
> performance of the system as a whole.  Remember that data=journal
> doesn't make the fsyncs fast, it just makes them faster.
> 
> 
>>Most persons using small fsyncs are using it because the person who 
>>wrote their application wrote it wrong.  What's more, many of the 
>>persons who wrote those applications cannot understand that they did it 
>>wrong even if you tell them (e.g. qmail author reportedly cannot 
>>understand, sendmail guys now understand but had Kirk McKusick on their 
>>staff and attending the meeting when I explained it to them so they are 
>>not very typical).  
>>
>>In other words, handling stupidity is an important life skill, and we 
>>all need to excell at it.;-)
>>
> 
> A real strength to linux is the application designers can talk directly
> to their own personal bottlenecks.  Hopefully we reward those that hunt
> us down and spend the time convincing us their applications are worth
> tuning for.  They then proceed to beat the pants off their competition.
> 
> 
>>Tell me what your thoughts are on the following:
>>
>>If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
>>the ones who would never send you an email)  the following 
>>questions, what percentage will answer which choice?
>>
>>The filesystem you are using is named:
>>
>>a) the Performance Optimized SuSE FS
>>
>>b) NTFS
>>
>>c) FAT
>>
>>d) ext2
>>
>>e) ReiserFS
>>
> 
> I believe the ones that know what a filesystem is will answer ReiserFS,
> You might get a lot of ext2 answers, just because that's what a lot of
> people think the linux filesystem is.
> 
> 
>>If you want to change reiserfs to use data journaling you must do which:
>>
>>a) reinstall the reiserfs package using rpm
>>
>>b) modify /etc/fs.conf
>>
>>c) reinstall the operating system from scratch, and select different 
>>options during the install this time
>>
>>d) reformat your reiserfs partition using mkreiserfs
>>
>>e) none of the above
>>
>>f) all of the above except e)
>>
> 
> These people won't be admins of systems big enough for the difference to
> matter.  data journaling is targeted at people with so much load they
> would have to buy more hardware to make up for it.  The new option
> lowers the price to performance ratio, which is exactly what we want to
> do for sendmails, egeneras, lycos, etc.  If it takes my laptop 20ms to
> deliver a mail message, cutting the time down to 10ms just won't matter.
> 
> 
>>
>>What do you think the chances are that you can convince Hubert that 
>>every SuSE Enterprise Edition user should be asked at install time if 
>>they are going to use fsync a lot on each partition, and to use a 
>>different fstab setting if yes?
>>
> 
> Very little, I might tell them to buy the suse email server instead,
> since that would have the settings done right.  data=journal is just a
> small part of mail server tuning.
> 
> 
>>I know that you are an experienced sysadmin who was good at it.  Your 
>>intuition tells you that most sysadmins are like the ones you were 
>>willing to hire into your group at the university.  They aren't.
>>
>>Linux needs to be like a telephone.  You plug it in, push buttons, and 
>>talk.  It works well, but most folks don't know why.
>>
>>
> 
> Exactly.  I think there are 3 classes of users at play here.
> 
> 1) Those who don't understand and don't have enough load to notice.
> 2) Those who don't understand and do have enough load to notice.
> 3) Those who do understand and do have enough load to notice.
> 
> #2 will buy support from someone, and they should be able to configure
> the thing right.
> 
> #3 will find the docs and do it right themselves.
> 
> 
>>A moderate number of programs are small fsync bound for the simple 
>>reason that it is simpler to write them that way.We need to cover 
>>over their simplistic designs.
>>
>>So, you have my sympathies Chris, because I believe you that it makes 
>>the code uglier and it won't be a joy to code and test.  I hope you also 
>>see that it should be done.
>>
> 
> Mostly, I feel this kind of tuning is a mistake right now.  The patch is
> young and there are so many places left to tweak...I'm still at the
> stage where much larger improvements are possible, and a better use of
> coding time.  Plus, it's monday and it's always more fun to debate than
> give in on mondays.
> 
> -chris
> 


Hi, Chris & Hans!

D

Re: [reiserfs-list] Error Code 255 and "Permission Denied"

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 20:54, Manuel Krause wrote:
> On 05/06/2002 09:12 PM, Daniel Christiansen wrote:
> 
> > Everything seems to work by using the 3.x.1b version of reisfsck with
> > --rebuild-tree.  Thank you very much, Oleg, for taking the time to solve
> > my problem.
> > 
> > I don't know anything about the "write cache enabled" issue below.  Is
> > this something I have to change with a jumper, a bios setting, or a
> > software configuration?
> > 

Most new IDE drives have this on by default.  You can turn it off with
hdparm -W 0, which will make you more able to withstand power failures.

> 
> Or is it somekind of connected to Chris Masons thread "2.4.19-pre7 / 
> corruption on unwanted reboot"??? If Chris and Jens found bugs on IDE 
> interaction with ReiserFS they should really put out a patch soon... ;-)
> 
> Chris M.? Is that related eventually? Just a doubt!

I haven't been able to reproduce problems after a crash with pre7, but
Dirk is not often wrong when we reports about a bug.  If anyone can
reliably reproduce I'd be grateful.

-chris





Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 21:17, Manuel Krause wrote:
> On 05/07/2002 12:57 AM, Chris Mason wrote:
>
> 
> Hi, Chris & Hans!
> 
> Don't think this somekind of destructive discussion would lead to 
> anything useful for now, can you post a diff for 
> 2.4.19-pre7+latest-related-pending +compound-patch-from-ftp?
> 
> I'll try it and report if that leads to more security and/or less 
> performance on my every day use with NS6 and so on if there is any.

The current data logging patches are at:

ftp.suse.com/pub/people/mason/patches/data-logging

They are against 2.4.19-pre7, and contain versions of the major (stable)
speedups.  The patch is pretty big, so I'm not likely to merge with the
namesys pending directories.  The namesys guys add things frequently,
and I think it would get confusing for people trying to figure out which
patches to apply.

The data logging stuff is beta code, if you have a good test bed where
it's ok if things go wrong I can make you a special patch with the
pending stuff merged.

-chris