What is the best file system for Lucene?

2004-11-30 Thread Sanyi
Hi!

I'm testing Lucene 1.4.2 on two very different configs, but with the same index.
I'm very surprised by the results: Both systems are searching at about the same 
speed, but I'd
expect (and I really need) to run Lucene a lot faster on my stronger config.

Config #1 (a notebook):
WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester

Config #2 (a desktop PC):
SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
15000RPM U320 SCSI
winchester

You can see that the hardware of #2 is at least twice better/faster than #1.
I'm searching the reason and the solution to take advantage of the better 
hardware compared to the
poor notebook.
Currently #2 can't amazingly outperform the notebook (#1).

The question is: What can be worse in #2 than on the poor notebook?

I can imagine only software problems.
Which are the sotware parts then?
1. The OS
Is SuSE 9.1 a LOT slower than WinXP pro?
2. The file system
Is reisefs a LOT slower than NTFS?

Regards,
Sanyi




__ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 Interesting, what are your merge settings

Sorry, I didn't mention that I was talking about search performance.
I'm using the same, fully optimized index on both systems.
(I've generated both indexes with the same code from the same database on the 
actual OS)

 which JDK are you using?

I'm using the same Sun JDK on both systems.
I've tried so far:
j2sdk1.4.2_04 _05 and _06.
I didn't notice speed differences between these subversions.
Do you know about significant speed differences between them I should notice?

 Have you tried with hyperthreading turned off on #2?

No, but I will try it if the problem isn't in the file system.
I hope that the reason of slowness is reiserfs, because it is the easiest to 
change.

What file systems are you people using Lucene on? And what are your experiences?

Regards,
Sanyi




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread sg
 What file systems are you people using Lucene on? And what are your
 experiences?

http://www.apple.com/xsan/

Actually it is a beta version and have some small issues but it is very fast 
and easy to manage in case you get it installed. 
The installation it self is tricky since it is very dependend on your network 
setup and need a well working dns, routings etc.
However it is fast as the wind. :-)

HTH
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Pete Lewis
Hi Sanyi

Could you try XP on your desktop - that would take some variables out.  The
problem is that you are comparing OS, as well as filesystems, as well as
different hardware configs.

Also, unless you take your hyperthreading off, with just one index you are
searching with just one half of the CPU - so your desktop is actually using
a 1.5GHz CPU for the search.  So, taking account of this its not too
surprising that they are searching at comparable speeds.

HTH
Pete

- Original Message - 
From: Sanyi [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 30, 2004 11:28 AM
Subject: Re: What is the best file system for Lucene?


  Interesting, what are your merge settings

 Sorry, I didn't mention that I was talking about search performance.
 I'm using the same, fully optimized index on both systems.
 (I've generated both indexes with the same code from the same database on
the actual OS)

  which JDK are you using?

 I'm using the same Sun JDK on both systems.
 I've tried so far:
 j2sdk1.4.2_04 _05 and _06.
 I didn't notice speed differences between these subversions.
 Do you know about significant speed differences between them I should
notice?

  Have you tried with hyperthreading turned off on #2?

 No, but I will try it if the problem isn't in the file system.
 I hope that the reason of slowness is reiserfs, because it is the easiest
to change.

 What file systems are you people using Lucene on? And what are your
experiences?

 Regards,
 Sanyi




 __
 Do you Yahoo!?
 The all-new My Yahoo! - What will yours do?
 http://my.yahoo.com

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread John Haxby

Sanyi wrote:
I'm testing Lucene 1.4.2 on two very different configs, but with the same index.
I'm very surprised by the results: Both systems are searching at about the same 
speed, but I'd
expect (and I really need) to run Lucene a lot faster on my stronger config.
Config #1 (a notebook):
WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
Config #2 (a desktop PC):
SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
15000RPM U320 SCSI
winchester
You can see that the hardware of #2 is at least twice better/faster than #1.
I'm searching the reason and the solution to take advantage of the better 
hardware compared to the
poor notebook.
Currently #2 can't amazingly outperform the notebook (#1).
 

How large is the index?   If it's less than a couple of GByte then it 
will be entirely in memory after you've done a few searches on the Linux 
box.  You can force it into memory by cat'ing all the index files on to 
/dev/null a couple of times (cat *  /dev/null).   A 3GHz system should 
now perform dramatically faster than a 1.5GHz system no matter what the 
file system. (And it's still 3GHz whether or not hyperthreading is 
turned on -- hyperthreading simply makes use of some under-used silicon 
to give you somewhere between 1 and 2 CPUs.  In some pathlogical cases 
it can give you less than one CPU, but I don't think lucene falls into 
the category.  And it's going to be a helluva lot faster than any 
Pentium M because it has a nice healthy cache.)

However, I don't believe that the hardware, OS or file system have 
anything to do with it.   Normally if you're seeing similar performance 
on widely differing platforms you're seeing latency somewhere else.   
For example (and this is only an example) looking up a hostname in the 
DNS will take about the same time on almost any machine you can get hold of.

You don't say how you're measuring search performance and you don't say 
what you're seeing.   Also, what's the load on the system while you're 
running the tests?   gkrellm on Linux is very useful as an overall view 
-- are you CPU bound, are you seeing lots of disk traffic?   Is the 
system actually more-or-less idle?

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What is the best file system for Lucene?

2004-11-30 Thread Justin Swanhart
On Tue, 30 Nov 2004 12:07:46 -, Pete Lewis [EMAIL PROTECTED] wrote:
 Also, unless you take your hyperthreading off, with just one index you are
 searching with just one half of the CPU - so your desktop is actually using
 a 1.5GHz CPU for the search.  So, taking account of this its not too
 surprising that they are searching at comparable speeds.
 
 HTH
 Pete

Actually, that isn't how hyperthreading works.  The second CPU in a
hyperthreaded system should only run threads when the main cpu is
waiting on another task, like a memory access.  The second, or sub CPU
is only a virtual processor.  There aren't really two chips on board. 
New multicore processors will actually have more than one processor 
in one chip.

Problems can arise when you are using a HT processor on an operating
system that doesn't know about HT technology.  The OS should only
schedule jobs to run on the sub CPU under very specific circumstances.
 This is one of the major reasons for the scheduler overhaul in Linux
2.6.  The default scheduler in 2.4 would assign threads to the sub CPU
that shouldn't have been, and those threads would suffer from resource
starvation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 Could you try XP on your desktop

Sure, but I'll only do that I run out of ideas.

 so your desktop is actually using
 a 1.5GHz CPU for the search.

No, this is not true. It uses a 3.0GHz P4 then.
(HT means that you have two 3.0GHz P4s)

So, it is still surprising to me.

Regards,
Sanyi




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 The notebook is quite good, e.g. the Pentium-M might be faster than
 your Pentium 4. At least it has a similar speed, because of it better
 internal design. Never compare cpus of different types by their
 frequency. 

Ok, this might be true, but:

All of my other tests where the CPU is involved, are running a LOT faster on 
the desktop PC with
the 3GHz P4.
Even other JAVA parts are running a LOT faster. (twice as fast nearly)
So, we can't even say that the JAVA VM takes no advantage of the 3GHz P4 
compared to the 1.8GHz
Pentium-M.
Everything is a LOT faster, except searching with lucene. (which is also a bit 
faster, but
slightly)

 Maybe your index is small enough to fit into the cache provided by the 
 operating systems. So you wouldn't recognize any difference between your
 hard disks.

It is a 3GByte index and I always reboot between tests, so cahcing is not the 
case.

 I don't think so. I'm using Windows 2000 pro and SuSE 9.0 and 
 (from my memory) Linux seems to be sightly faster, but I can't
 provide any benchmark now.

Are you using reiserfs with SuSE?

Regards,
Sanyi



__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Justin Swanhart
As a generalisation, SuSE itself is not a lot slower than Windows XP. 
I also very much doubt that filesystem is a factor.  If you want to
test w/out filesystem involvement, simply load your index into a
RAMDirectory instead of using FSDirectory.  That precludes filesystem
overhead in searches.

There are quite a number of factors involved that could be affecting
performance.

First off, 1.8GHz Pentium-M machines are supposed to run at about the
speed of a 2.4GHz machine.  The clock speeds on the mobile chips are
lower, but they tend to perform much better than rated.   I recommend
you take a general benchmark of both machines testing both disk speed
and cpu speed to get a baseline performance comparision.  I also
suggest turning of HT for your benchmarks and performance testing.

Secondly, while the second machine appears to be twice as fast, the
disk could actually perform slower on the Linux box, especially if the
notebook drive has a big (8M) cache like most 7200RPM ata disk drives
do.  I imagine that if you hit the index with lots of simultaneous
searches, that the Linux box would hold its own for much longer than
the XP box simply due to the random seek performance of the scsi disk
combined with scsi command queueing.

RAM speed is a factor too.  Is the p4 a xeon processor?  The older HT
xeons have a much slower bus than the newer p4-m processors.  Memory
speed will be affected accordingly.

I haven't heard of a hard disk referred to as a winchester disk in a
very long time :)

Once you have an idea of how the two machines actually compare
performance-wise, you can then judge how they perform index
operations.  Until then, all your measurements are subjective and you
don't gain much by comparing the two indexing processes.

Justin

On Tue, 30 Nov 2004 02:04:46 -0800 (PST), Sanyi [EMAIL PROTECTED] wrote:
 Hi!
 
 I'm testing Lucene 1.4.2 on two very different configs, but with the same 
 index.
 I'm very surprised by the results: Both systems are searching at about the 
 same speed, but I'd
 expect (and I really need) to run Lucene a lot faster on my stronger config.
 
 Config #1 (a notebook):
 WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
 
 Config #2 (a desktop PC):
 SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
 15000RPM U320 SCSI
 winchester
 
 You can see that the hardware of #2 is at least twice better/faster than #1.
 I'm searching the reason and the solution to take advantage of the better 
 hardware compared to the
 poor notebook.
 Currently #2 can't amazingly outperform the notebook (#1).
 
 The question is: What can be worse in #2 than on the poor notebook?
 
 I can imagine only software problems.
 Which are the sotware parts then?
 1. The OS
 Is SuSE 9.1 a LOT slower than WinXP pro?
 2. The file system
 Is reisefs a LOT slower than NTFS?
 
 Regards,
 Sanyi
 
 __
 Do you Yahoo!?
 Yahoo! Mail - You care about security. So do we.
 http://promotions.yahoo.com/new_mail
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 How large is the index?   If it's less than a couple of GByte then it 
 will be entirely in memory

It is 3GBytes big and it will grow a lot.
I have to search from the HDD which is very fast compared to the notebook's HDD.

Average seek time:
Notebook: 8-9ms
Desktop: 3.9ms

Data read:
Notebook: max. ~20MBytes/sec
Desktop: 60-80MBytes/sec

So, if the bottleneck is the HDD, it has to be 2x-3x faster on the desktop 
system.
Except if reiserfs is a lot slower than NTFS.

 For example (and this is only an example) looking up a hostname in the 
 DNS will take about the same time on almost any machine you can get hold of.

Ok, but I have very simple and pure tests and everything is measured 
part-by-part.
..and every parts speeds up a lot on the desltop system, except the lucene 
search part.

 You don't say how you're measuring search performance and you don't say 
 what you're seeing.

I call my java program from command line on both systems, like:
search hello
Then it searches for bravo and collects the elapsed milliseconds between every 
call to anything.
Then it displays the results. It is very simple.

 Also, what's the load on the system while you're 
 running the tests?   gkrellm on Linux is very useful as an overall view 
 -- are you CPU bound, are you seeing lots of disk traffic?   Is the 
 system actually more-or-less idle?

Thanx for the hint. Since my search searches for only 30 hits, it completes too 
fastly to let me
monitor it real-time.
Anyway, if reiserfs will prove to be fast enough, I'll search for other reasons 
and will perform
longer tests for real-time monitoring.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
 simply load your index into a
 RAMDirectory instead of using FSDirectory. 

I have 3GByte RAM and my index is 3GByte big currently. (it'll be soon about 
4GByte)
So, I have to find out this another way.

 First off, 1.8GHz Pentium-M machines are supposed to run at about the
 speed of a 2.4GHz machine.  The clock speeds on the mobile chips are
 lower, but they tend to perform much better than rated.   I recommend
 you take a general benchmark of both machines testing both disk speed
 and cpu speed to get a baseline performance comparision.

I think that it a good general benchmark that almost everything runs at least 
twice as fast on the
3.0GHz P4 except lucene search.

I can tell one more interesting info:
I have a MySQL table with ~20million records.
I throw a DROP INDEX on that table, MySQL rebuilds the whole huge table into a 
tempfile.
It completes in 30 minutes on both systems.
It doesn't matter again that the 15kRPM U320 HDD is 2x-3x as fast.
Very surprising again.
Hmm... reiserfs must be very-very slow, or I'm completly lost :)

 I also suggest turning of HT for your benchmarks and performance testing.

I'll try this later and I really hope it won't be the reason.

 Secondly, while the second machine appears to be twice as fast, the
 disk could actually perform slower on the Linux box, especially if the
 notebook drive has a big (8M) cache like most 7200RPM ata disk drives
 do. 

Both drives have 8M cache.

 I imagine that if you hit the index with lots of simultaneous
 searches, that the Linux box would hold its own for much longer than
 the XP box simply due to the random seek performance of the scsi disk
 combined with scsi command queueing.

Are you saying that SCSI command queuing wastes more time than a 15kRPM 3.9ms 
HDD can gain over a
7.2kRPM 8-9ms HDD?
It sounds terrible and I hope it isn't true.

 RAM speed is a factor too.  Is the p4 a xeon processor?  The older HT
 xeons have a much slower bus than the newer p4-m processors.  Memory
 speed will be affected accordingly.

It is not a Xeon, just a P4 3.0GHz HT.

 I haven't heard of a hard disk referred to as a winchester disk in a
 very long time :)

;)

 Once you have an idea of how the two machines actually compare
 performance-wise, you can then judge how they perform index
 operations.

Lucene indexing completes in 13-15 hours on the desktop system while it 
completes in about 29-33
hours on the notebook.

Now, combine it with the DROP INDEX tests completing in the same amount of time 
on both and find
out why is the search only slightly faster :)

 Until then, all your measurements are subjective and you
 don't gain much by comparing the two indexing processes.

I'm worried about searching. Indexing is a lot faster on the desktop config.

Regards,
Sanyi




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What is the best file system for Lucene?

2004-11-30 Thread Armbrust, Daniel C.
You may want to give the IBM JVM a try - I've found it faster in some cases...

http://www-106.ibm.com/developerworks/java/jdk/linux140/


Dan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What is the best file system for Lucene?

2004-11-30 Thread Armbrust, Daniel C.
As I understand hyperthreading, this is not true: 

Also, unless you take your hyperthreading off, with just one index you are
searching with just one half of the CPU - so your desktop is actually using
a 1.5GHz CPU for the search.

You still have the full speed of the processor available - the processor itself 
just keeps switching between different threads of execution.  Some people have 
noted that some (single threaded) applications will run 5-10% slower when 
hyperthreading is turned on - but that depends on the app.  It certainly won't 
be running at half speed.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Otis Gospodnetic
Hello,

 Lucene indexing completes in 13-15 hours on the desktop system while
 it completes in about 29-33
 hours on the notebook.
 
 Now, combine it with the DROP INDEX tests completing in the same
 amount of time on both and find
 out why is the search only slightly faster :)
 
  Until then, all your measurements are subjective and you
  don't gain much by comparing the two indexing processes.
 
 I'm worried about searching. Indexing is a lot faster on the desktop
 config.

This tells you that your problem is not the disk itself, and not the
fielsystem.  The bottleneck is elsewhere.

Why not run your search under a profiler?  That will tell you where the
JVM is spending its time.  It may even be in some weird InetAddress
call, like another person already pointed out.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
Thanx for the replies to you all.
I was looking for someone with the same experiences as mine ones, but it seems 
that I'll have to
test this myself.
I'll try out my ideas and the most interesting ideas from you guys.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]