Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-20 Thread Doron Cohen
Doron Cohen wrote:
> Doug Cutting wrote:
> > > Therefore, a "semi compound" segment file can be defined, that would
be
> > > made of 4 files (instead of 1):
> > > - File 0: .fdx .tis .tvx
> > > - File 1: .fdt .tii .tvd
> > > - File 2: .frq .tvf
> > > - File 3: .fnm .prx .fN
> >
> > I think this is a promising direction.  Perhaps instead of adding a
> > third index format, we can significantly improve the non-compound
format
> > without too much effort.  For example, simply writing all the norms
into
> > a single file could have a large impact on total file handles and would
> > be a rather simple change.  We could start with that, then see if there
> > are further incremental improvements to be had.
>
> We can start with that - at least it would set the number of segment
files
> to a fixed number - 11 - currently it depends on the number of fields
with
> norms.

Okay, started with this step - see issue 756
http://issues.apache.org/jira/browse/LUCENE-756

>
> One advantage of keeping the a plain non-compound format is educational /
> debugging - it is often helpful to actually see the files being created
on
> disk. (Although, just concatenating all norms to a single file is simple
> enough in that regard.)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-19 Thread Otis Gospodnetic
Some work on NIO-based FSDirectory has already been done.  Some performance 
info is included, too:
http://issues.apache.org/jira/browse/LUCENE-519
http://issues.apache.org/jira/browse/LUCENE-414

Otis


- Original Message 
From: Doug Cutting <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Sunday, December 17, 2006 2:31:42 PM
Subject: Re: potential indexing perormance improvement for compound index - cut 
IO - have more files though

Doron Cohen wrote:
> Also, if nio proves to be faster in this scenario, it might make sense to
> keep current FSDirectory, and just add FSDirectoryNio implementation.

If nio isn't considerably slower for single-threaded applications, I'd 
vote to simply switch FSDirectory to use nio, simplifying the public API 
by reducing choices.  But if classic io is faster for single-threaded 
apps, and nio faster for multi-threaded, that would suggest adding a 
new, public, nio-based Directory implementation.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-18 Thread robert engels
I think the important issues are index size, stability and number of  
concurrent readers.


We achieved the best performance by using a pool of file descriptors  
to a segment so we could avoid the synchronization block, but this  
only worked for large, relatively unchanging segments.



On Dec 18, 2006, at 2:51 PM, Doug Cutting wrote:


robert engels wrote:
Using a shared FileChannel.pread actually performs a  
synchronization under Windows.


Sigh.  Still, it'd be no worse than current FSDirectory on Windows.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-18 Thread Doug Cutting

robert engels wrote:
Using a shared FileChannel.pread actually performs a synchronization 
under Windows.


Sigh.  Still, it'd be no worse than current FSDirectory on Windows.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-18 Thread robert engels

A word of caution here...

Using a shared FileChannel.pread actually performs a synchronization  
under Windows.


See JDK bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734

I submitted this, and it was verified using the supplied test case.


On Dec 17, 2006, at 1:31 PM, Doug Cutting wrote:


Doron Cohen wrote:
Also, if nio proves to be faster in this scenario, it might make  
sense to

keep current FSDirectory, and just add FSDirectoryNio implementation.


If nio isn't considerably slower for single-threaded applications,  
I'd vote to simply switch FSDirectory to use nio, simplifying the  
public API by reducing choices.  But if classic io is faster for  
single-threaded apps, and nio faster for multi-threaded, that would  
suggest adding a new, public, nio-based Directory implementation.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-17 Thread Doug Cutting

Doron Cohen wrote:

Also, if nio proves to be faster in this scenario, it might make sense to
keep current FSDirectory, and just add FSDirectoryNio implementation.


If nio isn't considerably slower for single-threaded applications, I'd 
vote to simply switch FSDirectory to use nio, simplifying the public API 
by reducing choices.  But if classic io is faster for single-threaded 
apps, and nio faster for multi-threaded, that would suggest adding a 
new, public, nio-based Directory implementation.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Doron Cohen
Doug Cutting wrote:
> > Therefore, a "semi compound" segment file can be defined, that would be
> > made of 4 files (instead of 1):
> > - File 0: .fdx .tis .tvx
> > - File 1: .fdt .tii .tvd
> > - File 2: .frq .tvf
> > - File 3: .fnm .prx .fN
>
> I think this is a promising direction.  Perhaps instead of adding a
> third index format, we can significantly improve the non-compound format
> without too much effort.  For example, simply writing all the norms into
> a single file could have a large impact on total file handles and would
> be a rather simple change.  We could start with that, then see if there
> are further incremental improvements to be had.

We can start with that - at least it would set the number of segment files
to a fixed number - 11 - currently it depends on the number of fields with
norms.

One advantage of keeping the a plain non-compound format is educational /
debugging - it is often helpful to actually see the files being created on
disk. (Although, just concatenating all norms to a single file is simple
enough in that regard.)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Doron Cohen
Doug Cutting wrote:
> Doug Cutting wrote:
> > Yes.  On 32-bit systems with indexes larger than 1GB or so, memory
> > mapping is impractical, so synchronization is required around shared
> > file handles (using Java's classic i/o APIs, w/o pread).  The
> > non-compound format, with more files, has fewer synchronization
> > bottlenecks.  One could of course achieve the same improvements in
other
> > ways, e.g., by pooling multiple IndexReaders per index, but in straight

> > A-to-B comparisons, folks see better throughput with non-compound
> > indexes for multi-threaded applications.
>
> On second thought, a good fix for this might be to simply convert
> FSDirectory to use nio's pread support, eliminating file handle
> synchronization even when mmap isn't used.

Comparing the two for a small index (100,000 docs of the Reuters
collection, index size 170MB) showed no evident search performance
advantage for non-compound. For 300 parallel searches with traversing of
docs compound was faster. But this is a small index, not in the 1GB range,
and search was fast anyhow.

I think it would make sense to first verify the advantage of nio over io in
this multi-reading scenario with a synthetic scenario.

Also, if nio proves to be faster in this scenario, it might make sense to
keep current FSDirectory, and just add FSDirectoryNio implementation.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Doug Cutting

Doug Cutting wrote:
Yes.  On 32-bit systems with indexes larger than 1GB or so, memory 
mapping is impractical, so synchronization is required around shared 
file handles (using Java's classic i/o APIs, w/o pread).  The 
non-compound format, with more files, has fewer synchronization 
bottlenecks.  One could of course achieve the same improvements in other 
ways, e.g., by pooling multiple IndexReaders per index, but in straight 
A-to-B comparisons, folks see better throughput with non-compound 
indexes for multi-threaded applications.


On second thought, a good fix for this might be to simply convert 
FSDirectory to use nio's pread support, eliminating file handle 
synchronization even when mmap isn't used.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Doug Cutting

Doug Cutting wrote:
> I'm not yet convinced that the costs of this mid-point justify its
> benefits.

That was too negative.  Let me try a more positive angle.

Doron Cohen wrote:

Therefore, a "semi compound" segment file can be defined, that would be
made of 4 files (instead of 1):
- File 0: .fdx .tis .tvx
- File 1: .fdt .tii .tvd
- File 2: .frq .tvf
- File 3: .fnm .prx .fN


I think this is a promising direction.  Perhaps instead of adding a 
third index format, we can significantly improve the non-compound format 
without too much effort.  For example, simply writing all the norms into 
a single file could have a large impact on total file handles and would 
be a rather simple change.  We could start with that, then see if there 
are further incremental improvements to be had.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Doug Cutting

Marvin Humphrey wrote:
Out of curiosity, does the non-compound format yield any search-time 
benefits?


Yes.  On 32-bit systems with indexes larger than 1GB or so, memory 
mapping is impractical, so synchronization is required around shared 
file handles (using Java's classic i/o APIs, w/o pread).  The 
non-compound format, with more files, has fewer synchronization 
bottlenecks.  One could of course achieve the same improvements in other 
ways, e.g., by pooling multiple IndexReaders per index, but in straight 
A-to-B comparisons, folks see better throughput with non-compound 
indexes for multi-threaded applications.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Marvin Humphrey


On Dec 15, 2006, at 2:04 PM, Otis Gospodnetic wrote:

I think Doron is right on the money here.  I know one "customer"  
who'd be happy to trade its file descriptors for less IO -  
simpy.com.  It's exactly what Doron describes - a busy system with  
a LOT of indices.  File descriptors are kept under control by  
carefully closing IndexSearchers, plus I can always increase the  
max open-files limit.  What I can't easily increase is the disk  
IO.  Sure, I could go from CFS to the multi-file format, but it  
would be nice to have that third, middle ground choice.


Out of curiosity, does the non-compound format yield any search-time  
benefits?  I would think that would be the case only if the system- 
level file stream feeding the the buffered IndexInput objects were  
maintaining its own (unnecessary) buffers.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-16 Thread Doug Cutting

Otis Gospodnetic wrote:

I think Doron is right on the money here.  I know one "customer" who'd be happy 
to trade its file descriptors for less IO - simpy.com.  It's exactly what Doron describes 
- a busy system with a LOT of indices.  File descriptors are kept under control by 
carefully closing IndexSearchers, plus I can always increase the max open-files limit.  
What I can't easily increase is the disk IO.  Sure, I could go from CFS to the multi-file 
format, but it would be nice to have that third, middle ground choice.


The problem is that adding that middle ground isn't free: it will 
complicate the code and make it harder to maintain and evolve.  If you 
have good control over file handles, then non-compound format should 
work just fine, no?


I'm not yet convinced that the costs of this mid-point justify its 
benefits.  Perhaps the changes are simpler than I imagine.  Perhaps it 
can be done very simply and elegantly with little impact on the code. 
If so, then my concerns will be reduced.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-15 Thread Otis Gospodnetic
I think Doron is right on the money here.  I know one "customer" who'd be happy 
to trade its file descriptors for less IO - simpy.com.  It's exactly what Doron 
describes - a busy system with a LOT of indices.  File descriptors are kept 
under control by carefully closing IndexSearchers, plus I can always increase 
the max open-files limit.  What I can't easily increase is the disk IO.  Sure, 
I could go from CFS to the multi-file format, but it would be nice to have that 
third, middle ground choice.

Otis

- Original Message 
From: Doron Cohen <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Friday, December 15, 2006 2:55:41 PM
Subject: Re: potential indexing perormance improvement for compound index - cut 
IO - have more files though

"Mike Klaas" <[EMAIL PROTECTED]> wrote:
>
> My main comment is that the benefits of this change can be achieved by
> using the non-compound index format.  For people that care about the
> difference in performance, it isn't difficult to configure your system
> to mitigate the problems of the non-compound format, and they probably
> have already done so.
>
> It would help the people who are file-descriptor conscious, but it
> also increases lucene's fd footprint by a factor of four.

That's right - people worried about indexing performance can easily apply
setUseCompound(false).

My guess though is that most people just keep the default setting.

Large systems that maintain many indexes, would be worried about the number
of file descriptors and would use compound format. But it is not clear to
me what would be the preference in such systems - four times the file
descriptors, or twice as much the IO?  If such a third choice is supported
- "semmi compound" - how many systems would {be able to / choose to} use
it? Depending on the specific system maybe.

I verified the IO factor, by counting bytes read in
FSIndexInput.readInternal(byte[],int,int) and written in
FSIndexOutput.flushBuffer(byte[],int):

 round  vect  stor cmpnd   runCnt   recsPerRun  rec/s  elapsedSecwrite
read
 0  true  true  true1   10  153.4  651.742 GB
1.9 GB
 -   1  true  true false -  -   1 -  -  10  169.5 -  - 589.82 -  1 GB
0.9 GB
 2 false false  true1   10  151.4  660.412 GB
1.9 GB
 -   3 false false false -  -   1 -  -  10  168.0 -  - 595.37 -  1 GB
0.9 GB

Indeed, there is a factor of two for both read bytes and written bytes.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-15 Thread Doron Cohen
"Mike Klaas" <[EMAIL PROTECTED]> wrote:
>
> My main comment is that the benefits of this change can be achieved by
> using the non-compound index format.  For people that care about the
> difference in performance, it isn't difficult to configure your system
> to mitigate the problems of the non-compound format, and they probably
> have already done so.
>
> It would help the people who are file-descriptor conscious, but it
> also increases lucene's fd footprint by a factor of four.

That's right - people worried about indexing performance can easily apply
setUseCompound(false).

My guess though is that most people just keep the default setting.

Large systems that maintain many indexes, would be worried about the number
of file descriptors and would use compound format. But it is not clear to
me what would be the preference in such systems - four times the file
descriptors, or twice as much the IO?  If such a third choice is supported
- "semmi compound" - how many systems would {be able to / choose to} use
it? Depending on the specific system maybe.

I verified the IO factor, by counting bytes read in
FSIndexInput.readInternal(byte[],int,int) and written in
FSIndexOutput.flushBuffer(byte[],int):

 round  vect  stor cmpnd   runCnt   recsPerRun  rec/s  elapsedSecwrite
read
 0  true  true  true1   10  153.4  651.742 GB
1.9 GB
 -   1  true  true false -  -   1 -  -  10  169.5 -  - 589.82 -  1 GB
0.9 GB
 2 false false  true1   10  151.4  660.412 GB
1.9 GB
 -   3 false false false -  -   1 -  -  10  168.0 -  - 595.37 -  1 GB
0.9 GB

Indeed, there is a factor of two for both read bytes and written bytes.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-15 Thread Mike Klaas

On 12/14/06, Doron Cohen <[EMAIL PROTECTED]> wrote:


But anyhow, this is not a negligible difference, and for real large
indexes, and busy systems, when the just written non-compound segment is
not in the system caches, it might have more effect. Possibly, search
performance during indexing would be improved by less indexing IO. Also,
delay for addDocument call that triggers a merge should become smaller.

Thanks for your comments, also (but not only) on (1) an (3) above.


My main comment is that the benefits of this change can be achieved by
using the non-compound index format.  For people that care about the
difference in performance, it isn't difficult to configure your system
to mitigate the problems of the non-compound format, and they probably
have already done so.

It would help the people who are file-descriptor conscious, but it
also increases lucene's fd footprint by a factor of four.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-15 Thread Doron Cohen

Hi,

I would like to propose and get feedback on a potential indexing
performance improvement for the case that compound file is used (this is
the default).

In compound segment mode, each merge operation is ended by writing a
compound file. To be more precise, the merge result is first written to
directory as a non-compound segment file, and then it is 'converted' into a
compound segment file. This conversion involves reading the entire (non
compound) segment, and writing it as a compound segment file. This means
that compound mode indexing does twice as much indexing writing comparing
to non-compound mode. (and there's also the reading of the non-compound
segment).

The reason for this two steps process in writing compound segment files is
that per-segment files cannot be written sequentially, one by one - several
files are created together, written interleaved.

But I think that there is an intermediate state - between
one-compound-segment-file and non-compound-many-files.

To my understanding, at merge time, the following apply:
- .fnm - field infos - independent of other files.
- .fdx .fdt - store fields - interleaved with each other, independent of
other files.
- .tis .tii .frq .prx - dictionary and postings - interleaved with each
other, independent of other files.
- .tvx .tvd .tvf - term vectors - interleaved with each other, independent
of other files.
- .fN - norms - all these files written sequentially, independent of other
files.

Therefore, a "semi compound" segment file can be defined, that would be
made of 4 files (instead of 1):
- File 0: .fdx .tis .tvx
- File 1: .fdt .tii .tvd
- File 2: .frq .tvf
- File 3: .fnm .prx .fN

A merge should be able to write this segment representation at once, - no
need to read and write again.

Few questions:
(1) is this correct at all, or have I overlooked something?
(2) what performance gain would that buy?
(3) is it reasonable to have 4 files per segment comparing to 1 file per
segment?

For (2), the indexing performance of non compound is an upper bound. I
compared indexing speeds of compound and non compound, using the Reuters
input set. Tried with stored+vectors, and without stored fields:

 round  vect  stor cmpnd   runCnt   recsPerRunrec/s  elapsedSec
 0  true  true  true121578150.2  143.69
 -   1  true  true false -  -   1 -  -   21578 -  -   178.9 -  - 120.58
 2 false false  true121578164.7  131.03
 -   3 false false false -  -   1 -  -   21578 -  -   184.3 -  - 117.07

This is a 19% speed-up with stored+vectors, and 12% speed-up with no stored
fields.

As a side comment, it says something on IO vs. CPU in Lucene indexing, that
cutting 1/2 (I think) the file output speeds-up by less than 20%.

But anyhow, this is not a negligible difference, and for real large
indexes, and busy systems, when the just written non-compound segment is
not in the system caches, it might have more effect. Possibly, search
performance during indexing would be improved by less indexing IO. Also,
delay for addDocument call that triggers a merge should become smaller.

Thanks for your comments, also (but not only) on (1) an (3) above.
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]