Re: Faking index merge by modifying segments file?

2005-11-02 Thread Paul Elschot
On Wednesday 02 November 2005 12:47, Otis Gospodnetic wrote:
> Hello,
> 
> --- Paul Elschot <[EMAIL PROTECTED]> wrote:
...
> 
> > It's possible to share segments between indexes when the file system
> > allows files to be present in multiple directories.
> 
> Oh, are you saying that I could just leave segments where they are and
> use something like symlinks to point to them from a new index?
> 
> e.g.
> A: 
> B: 
> C: 
>
>

Yes, see here (Doug on updating the technorati indexes):
http://www.opensubscriber.com/message/lucene-user@jakarta.apache.org/803308.html

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Faking index merge by modifying segments file?

2005-11-02 Thread Robert Engels
There only need to be sorted if segA and segB were combined so in your case,
this is not needed.

I am not sure that what you are describing is any different than how
MultiReader works, and it does not need to perform any file copying of
linking.

Just create the new index. Write the documents. And open all indexes using a
MultiReader? Maybe I am missing something, but I see that as a simple way of
doing what you are trying to do.


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 02, 2005 5:14 AM
To: java-dev@lucene.apache.org
Subject: RE: Faking index merge by modifying segments file?


Hello,

--- Robert Engels <[EMAIL PROTECTED]> wrote:

> Problem is the terms need to be sorted in a single segment.

Are you referring to Term Dictionary (.tis and .tii files as described
at http://lucene.apache.org/java/docs/fileformats.html )?  If so, is
that really true?

I don't have an optimized Lucene multi-file index handy to look at, but
.tis and .tii files are "per segment" files, so wouldn't a set of .tis
and .tii files from multiple indices be equivalent to a set of .tis and
.tii files from multiple segments of a single index?

For example, if we have two indices, A and B, both optimized, we have:

A: segA.tis   (this may contain terms bar and foo)
   segA.tii
   ...
   segments   (this would list segA)

B: segB.tis   (this may contain terms piggy and bank)
   segB.tii
   ...
   segments   (this would list segB)

Wouldn't that be the same as a single index, say index C:

C: segA.tis   (this may contain terms bar and foo)
   segA.tii
   segB.tis   (this may contain terms piggy and bank)
   segB.tii
   ...
   segments   (this would list segments segA and segB)


That is really what I am talking about: take all index files of index A
and all index files of segment B and stick them in a new index dir for
a new index C.  Then open segments files of index A and index B, pull
out segment names and other information from there, and write a new
segments file with that information in index dir for that new index C.

This sounds like it should be possible, except for docId clashes - if
index A had a document with Id 100 and index B also has a document with
Id 100, after my index file copying, index C will end up having 2
documents with Id 100, and that won't work.  So, documents in C would
have to be renumbered (re-assigned Ids), as they get renumbered during
optimization, but without rewriting all index files in index C.

Does this sound right?

Also, I may not need to actually copy/move files around, if I just make
use of sym/hard links.

Thanks,
Otis


> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 01, 2005 1:52 AM
> To: java-dev@lucene.apache.org
> Subject: Faking index merge by modifying segments file?
>
>
> Hello,
>
> I spent most of today talking to some people about Lucene, and one of
> them said how they would really like to have an "instantaneous index
> merge", and how he is thinking he could achieve that by simply
> opening
> segments file of one index, and adding segment names of the other
> index/indices, plus adjusting the segment size (SegSize in
> fileformats.html), thus creating a single (but unoptimized) index.
>
> Any reactions to that?
>
> I imagine this isn't quite that simple to implement, as one would
> have
> to renumber all documents, in order to avoid having multiple
> documents
> with the same document id.
>
> Can anyone think of any other problems with this approach, or perhaps
> offer ideas for possible document renumbering?
>
> Thanks,
> Otis
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Faking index merge by modifying segments file?

2005-11-02 Thread David Balmain
> This sounds like it should be possible, except for docId clashes - if
> index A had a document with Id 100 and index B also has a document with
> Id 100, after my index file copying, index C will end up having 2
> documents with Id 100, and that won't work.  So, documents in C would
> have to be renumbered (re-assigned Ids), as they get renumbered during
> optimization, but without rewriting all index files in index C.
>
> Does this sound right?
>

As Paul Elschot already mentioned, the document ids aren't stored in
the index. The document id is really just the position of the document
in the segment and doc ids for whole index are created dynamically by
the index reader. So no renumbering is necessary.

> Also, I may not need to actually copy/move files around, if I just make
> use of sym/hard links.
>

Sure, as long as the other index isn't being updated any more. One
thing to note though. As well as making sure that none of the
filenames between the indexes clash, you'll have to make sure you
adjust the counter variable in SegmentInfos so that new files won't
clash with existing ones.

Regards,
Dave

> Thanks,
> Otis
>
>
> > -Original Message-
> > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, November 01, 2005 1:52 AM
> > To: java-dev@lucene.apache.org
> > Subject: Faking index merge by modifying segments file?
> >
> >
> > Hello,
> >
> > I spent most of today talking to some people about Lucene, and one of
> > them said how they would really like to have an "instantaneous index
> > merge", and how he is thinking he could achieve that by simply
> > opening
> > segments file of one index, and adding segment names of the other
> > index/indices, plus adjusting the segment size (SegSize in
> > fileformats.html), thus creating a single (but unoptimized) index.
> >
> > Any reactions to that?
> >
> > I imagine this isn't quite that simple to implement, as one would
> > have
> > to renumber all documents, in order to avoid having multiple
> > documents
> > with the same document id.
> >
> > Can anyone think of any other problems with this approach, or perhaps
> > offer ideas for possible document renumbering?
> >
> > Thanks,
> > Otis
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Faking index merge by modifying segments file?

2005-11-02 Thread Otis Gospodnetic
Hello,

--- Paul Elschot <[EMAIL PROTECTED]> wrote:

> On Tuesday 01 November 2005 08:51, Otis Gospodnetic wrote:
> > Hello,
> > 
> > I spent most of today talking to some people about Lucene, and one
> of
> > them said how they would really like to have an "instantaneous
> index
> > merge", and how he is thinking he could achieve that by simply
> opening
> > segments file of one index, and adding segment names of the other
> > index/indices, plus adjusting the segment size (SegSize in
> > fileformats.html), thus creating a single (but unoptimized) index.
> > 
> > Any reactions to that?
> > 
> > I imagine this isn't quite that simple to implement, as one would
> have
> > to renumber all documents, in order to avoid having multiple
> documents
> > with the same document id.
> > 
> > Can anyone think of any other problems with this approach, or
> perhaps
> > offer ideas for possible document renumbering?
> 
> Document numbers within segments are determined dynamically in the
> index reader, so these should not be a problem. Each segment simply
> numbers
> its documents from zero.

Uh, and I always thought they were stored in the index.  Aren't they
stored in the .fdx and .fdt files?  And shouldn't they also be linked
from some place.  I see a mention of document numbers in information
about the .frq.

> Iirc the segment names determine the order
> of the segments for an index reader.
> 
> I think creating a new index by adding segments from an existing one
> should
> be fairly straightforward. Some care will be needed to avoid
> clashes in the segment names.

You mean ensuring that segment _x from index A doesn't clash with _x
from index B?  Segment names are written only in the segments file, I
believe, so I think if I detect that _x is already taken, I could
simply rename it to something (e.g. _foo) that hasn't been taken yet,
and remember to use that segment name when writing the segments file.

> Also what should happen with
> the index from which the segments are taken? Should the shared
> segments be copied between indexes?

I can simply distroy the original index once I've created a fakely
merged one.  I'm not sure what you mean by shared segments.  If I have
two indices, A and B, then each of them will have its own set of
segments with no segments in common.

> It's possible to share segments between indexes when the file system
> allows files to be present in multiple directories.

Oh, are you saying that I could just leave segments where they are and
use something like symlinks to point to them from a new index?

e.g.
A: 
B: 
C: 
   
   

?

Thanks,
Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Faking index merge by modifying segments file?

2005-11-02 Thread Otis Gospodnetic
Hello,

--- Robert Engels <[EMAIL PROTECTED]> wrote:

> Problem is the terms need to be sorted in a single segment.

Are you referring to Term Dictionary (.tis and .tii files as described
at http://lucene.apache.org/java/docs/fileformats.html )?  If so, is
that really true?

I don't have an optimized Lucene multi-file index handy to look at, but
.tis and .tii files are "per segment" files, so wouldn't a set of .tis
and .tii files from multiple indices be equivalent to a set of .tis and
.tii files from multiple segments of a single index?

For example, if we have two indices, A and B, both optimized, we have:

A: segA.tis   (this may contain terms bar and foo)
   segA.tii
   ...
   segments   (this would list segA)

B: segB.tis   (this may contain terms piggy and bank)
   segB.tii
   ...
   segments   (this would list segB)

Wouldn't that be the same as a single index, say index C:

C: segA.tis   (this may contain terms bar and foo)
   segA.tii
   segB.tis   (this may contain terms piggy and bank)
   segB.tii
   ...
   segments   (this would list segments segA and segB)


That is really what I am talking about: take all index files of index A
and all index files of segment B and stick them in a new index dir for
a new index C.  Then open segments files of index A and index B, pull
out segment names and other information from there, and write a new
segments file with that information in index dir for that new index C.

This sounds like it should be possible, except for docId clashes - if
index A had a document with Id 100 and index B also has a document with
Id 100, after my index file copying, index C will end up having 2
documents with Id 100, and that won't work.  So, documents in C would
have to be renumbered (re-assigned Ids), as they get renumbered during
optimization, but without rewriting all index files in index C.

Does this sound right?

Also, I may not need to actually copy/move files around, if I just make
use of sym/hard links.

Thanks,
Otis


> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 01, 2005 1:52 AM
> To: java-dev@lucene.apache.org
> Subject: Faking index merge by modifying segments file?
> 
> 
> Hello,
> 
> I spent most of today talking to some people about Lucene, and one of
> them said how they would really like to have an "instantaneous index
> merge", and how he is thinking he could achieve that by simply
> opening
> segments file of one index, and adding segment names of the other
> index/indices, plus adjusting the segment size (SegSize in
> fileformats.html), thus creating a single (but unoptimized) index.
> 
> Any reactions to that?
> 
> I imagine this isn't quite that simple to implement, as one would
> have
> to renumber all documents, in order to avoid having multiple
> documents
> with the same document id.
> 
> Can anyone think of any other problems with this approach, or perhaps
> offer ideas for possible document renumbering?
> 
> Thanks,
> Otis
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Faking index merge by modifying segments file?

2005-11-01 Thread Robert Engels
The solution we came up with is (I think) a bit better, since it does
require any copying of files.

Since MultiSegmentReader already does the segment/document # offsetting, and
a segment does not change after written, we created a reopen() method that
reopens an existing index, (knowing which segments are the same, so those do
not need to be opened again).

For a general library, some sort of reference counting for an open segment
would need to be implemented. (Since our index management is behind another
layer this is not needed).

-Original Message-
From: Kevin Oliver [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 01, 2005 11:18 AM
To: java-dev@lucene.apache.org
Subject: RE: Faking index merge by modifying segments file?


Hello Otis,

I worked on a similar issue a couple on months ago. I've included our
email conversation below.

Hopefully, your thread will prompt more interest from the mailing list.

-Kevin



Sort of -- but only within a very controlled situation along with some
hackery you can comment out both of them.

Here's what I had to do -- in pseudo code.

Create a new IndexWriter subclass, call it IndexWriter2 that gets its
segmentCounter initialized to the real (the actual pre-existing index)
index's segmentCounter - 1. I suspect this part is not very robust, as I
don't completely understand why I needed to subtract 1 (it has something
to do with the temporary RAMDirectory that gets used before actually
getting written to disk).

Add the documents into our IndexWriter2 so that they get properly
written to a separate place on the file system and get the correct
"next" segment names that would appear in the real index.

Within the addIndexes() loop over the dirs, you move all the newly
created files from their current location over to the real indexes file
directory. This part in particular feels very hacked.

Finally, instead of calling optimize() at the end of addIndexes(), you
rewrite the segments file so that it includes all these new segments.


One other note is that if I wasn't using compound files, then I _think_
I could just rename the all of the files when they get moved into the
real index's file directory. But, compound files create their internal
files using the segmentName that the segment was created with, thus
creating a mismatch when you rename it externally.

-Kevin


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 11, 2005 3:36 PM
To: java-dev@lucene.apache.org
Subject: Re: Avoiding segment merges during indexing

Kevin - are you saying that you can just comment out the 2 optimize()
calls and addIndexes(Directory[]) will keep working?  I don't recall
why there are optimize() calls again, but I know several people had
issues with it...

Otis

--- Kevin Oliver <[EMAIL PROTECTED]> wrote:

> This is a proposal that is in need of some insights.
>
> In an effort to speed up adding documents to an existing index, we
> are
> pursuing using IndexWriter.addIndexes(Directory[]). In theory this
> should work great -- you index your new documents into a new
> Directory,
> then add them into to your existing directory, saving you the time
> spent
> merging segments that would be caused by the normal
> IndexWriter.addDocument(Document) calls during indexing.
>
> However, addIndexes() has the property that it calls optimize() both
> before and after adding the new directories. This wipes out the
> performance boost, and then some.
>
> So I found a way to work around this, but I don't like what I've had
> to
> do and I was wondering if anybody has any ideas on what could be done
> to
> make this more pleasant.
>
> It appears that by getting the new segment files into the existing
> directory, with the correct segment names, it will work without all
> of
> the optimize calls. Unfortunately, getting the segment names right
> and
> getting the files into the right location is a big ugly hack and is
> quite fragile.
>
> Is there a better way? I think maybe some explanation into why the 2
> optimizes are there would help my understanding. Is there a clean way
> of
> doing what I'm proposing? Is there some hidden catch I'm missing and
> I've been going down the wrong path?
>
> It seems to me this would be a great benefit to anyone who does
> indexing
> on existing indexes and wants it to be fast.
>
> Thanks,
> Kevin Oliver
>


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Monday, October 31, 2005 11:52 PM
To: java-dev@lucene.apache.org
Subject: Faking index merge by modifying segments file?

Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that

RE: Faking index merge by modifying segments file?

2005-11-01 Thread Kevin Oliver
Hello Otis,

I worked on a similar issue a couple on months ago. I've included our
email conversation below. 

Hopefully, your thread will prompt more interest from the mailing list. 

-Kevin



Sort of -- but only within a very controlled situation along with some
hackery you can comment out both of them. 

Here's what I had to do -- in pseudo code.

Create a new IndexWriter subclass, call it IndexWriter2 that gets its
segmentCounter initialized to the real (the actual pre-existing index)
index's segmentCounter - 1. I suspect this part is not very robust, as I
don't completely understand why I needed to subtract 1 (it has something
to do with the temporary RAMDirectory that gets used before actually
getting written to disk).

Add the documents into our IndexWriter2 so that they get properly
written to a separate place on the file system and get the correct
"next" segment names that would appear in the real index.

Within the addIndexes() loop over the dirs, you move all the newly
created files from their current location over to the real indexes file
directory. This part in particular feels very hacked.

Finally, instead of calling optimize() at the end of addIndexes(), you
rewrite the segments file so that it includes all these new segments. 


One other note is that if I wasn't using compound files, then I _think_
I could just rename the all of the files when they get moved into the
real index's file directory. But, compound files create their internal
files using the segmentName that the segment was created with, thus
creating a mismatch when you rename it externally.

-Kevin


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 11, 2005 3:36 PM
To: java-dev@lucene.apache.org
Subject: Re: Avoiding segment merges during indexing

Kevin - are you saying that you can just comment out the 2 optimize()
calls and addIndexes(Directory[]) will keep working?  I don't recall
why there are optimize() calls again, but I know several people had
issues with it...

Otis

--- Kevin Oliver <[EMAIL PROTECTED]> wrote:

> This is a proposal that is in need of some insights.
> 
> In an effort to speed up adding documents to an existing index, we
> are
> pursuing using IndexWriter.addIndexes(Directory[]). In theory this
> should work great -- you index your new documents into a new
> Directory,
> then add them into to your existing directory, saving you the time
> spent
> merging segments that would be caused by the normal
> IndexWriter.addDocument(Document) calls during indexing. 
> 
> However, addIndexes() has the property that it calls optimize() both
> before and after adding the new directories. This wipes out the
> performance boost, and then some. 
> 
> So I found a way to work around this, but I don't like what I've had
> to
> do and I was wondering if anybody has any ideas on what could be done
> to
> make this more pleasant.
> 
> It appears that by getting the new segment files into the existing
> directory, with the correct segment names, it will work without all
> of
> the optimize calls. Unfortunately, getting the segment names right
> and
> getting the files into the right location is a big ugly hack and is
> quite fragile.
> 
> Is there a better way? I think maybe some explanation into why the 2
> optimizes are there would help my understanding. Is there a clean way
> of
> doing what I'm proposing? Is there some hidden catch I'm missing and
> I've been going down the wrong path?
> 
> It seems to me this would be a great benefit to anyone who does
> indexing
> on existing indexes and wants it to be fast. 
> 
> Thanks,
> Kevin Oliver
> 


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Monday, October 31, 2005 11:52 PM
To: java-dev@lucene.apache.org
Subject: Faking index merge by modifying segments file?

Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any reactions to that?

I imagine this isn't quite that simple to implement, as one would have
to renumber all documents, in order to avoid having multiple documents
with the same document id.

Can anyone think of any other problems with this approach, or perhaps
offer ideas for possible document renumbering?

Thanks,
Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Faking index merge by modifying segments file?

2005-11-01 Thread Robert Engels
Problem is the terms need to be sorted in a single segment.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 01, 2005 1:52 AM
To: java-dev@lucene.apache.org
Subject: Faking index merge by modifying segments file?


Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any reactions to that?

I imagine this isn't quite that simple to implement, as one would have
to renumber all documents, in order to avoid having multiple documents
with the same document id.

Can anyone think of any other problems with this approach, or perhaps
offer ideas for possible document renumbering?

Thanks,
Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Faking index merge by modifying segments file?

2005-11-01 Thread Paul Elschot
On Tuesday 01 November 2005 08:51, Otis Gospodnetic wrote:
> Hello,
> 
> I spent most of today talking to some people about Lucene, and one of
> them said how they would really like to have an "instantaneous index
> merge", and how he is thinking he could achieve that by simply opening
> segments file of one index, and adding segment names of the other
> index/indices, plus adjusting the segment size (SegSize in
> fileformats.html), thus creating a single (but unoptimized) index.
> 
> Any reactions to that?
> 
> I imagine this isn't quite that simple to implement, as one would have
> to renumber all documents, in order to avoid having multiple documents
> with the same document id.
> 
> Can anyone think of any other problems with this approach, or perhaps
> offer ideas for possible document renumbering?

Document numbers within segments are determined dynamically in the
index reader, so these should not be a problem. Each segment simply numbers
its documents from zero. Iirc the segment names determine the order
of the segments for an index reader.

I think creating a new index by adding segments from an existing one should
be fairly straightforward. Some care will be needed to avoid
clashes in the segment names. Also what should happen with
the index from which the segments are taken? Should the shared segments
be copied between indexes?
It's possible to share segments between indexes when the file system allows
files to be present in multiple directories.

Regards,
Paul Elschot





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]