Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

So I clustered this app:

So I switched to clustering the RAMDirectory instead of the
IndexWriter and it worked in my experiments. What I did was create a
new IndexWriter on Document Adds and a new IndexSearcher on document
queries.

What I want to know is. How non-standard is this approach?

Cheers,
Steve

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> Interesting.
> I wonder, I have a notification mechanism at my disposal as well. I
> wonder if it could be worked out that, much like a mvc, an IndexReader
> could be notified when the underlying Directory has changed so that
> the reader can adjust itself?

Another little factor is that the IndexWriter must be closed before
the IndexReader is opened to see all the changes.

There is cost to opening and using a new IndexReader such as reading
the term index and the norms.  One would probably want to have some
sort of logic to limit how fast a new IndexReader was opened (which
diminishes the value of realtime updates to the underlying
IndexWriter).

It still should be doable though.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

> On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> > On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> > > My guess is that some segment of the world cares a lot about realtime
> > > coherent updates and some segment of the world needs blinding speed.
> > > Part of my research is to gather the expertise of this group on these
> > > issues.
> >
> > I hear ya...
> >
> > There is another part to the equation for Lucene though.
> > Coherent realtime updates to the IndexWriter/RamDirectory alone
> > doesn't get you all the way there since things are only readable
> > through an IndexReader that needs to be reopened to see changes.
> >
> >
> > -Yonik
> > http://incubator.apache.org/solr Solr, the open-source Lucene search server
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Oops, I made a change and didn't test it. Doh,
This should work better:

package org.apache.lucene.index;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Vector;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

final class SegmentInfos {

/** The file format version, a negative number. */
/* Works since counter, the old 1st entry, is always >= 0 */
public static final int FORMAT = -1;

public int counter = 0; // used to name new segments

private Vector vector = new Vector();

/**
 * counts how often the index has been changed by adding or deleting 
docs.
 * starting with the current time in milliseconds forces to create 
unique
 * version numbers.
 */
private long version = System.currentTimeMillis();

public final SegmentInfo info(int i) {
return (SegmentInfo) vector.elementAt(i);
}

public final void read(Directory directory) throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
try {
int format = input.readInt();
if (format < 0) { // file contains explicit format info
// check that it is a format we can understand
if (format < FORMAT)
throw new IOException("Unknown format 
version: " + format);
version = input.readLong(); // read version
counter = input.readInt(); // read counter
} else { // file is in old format without explicit 
format info
counter = format;
}

for (int i = input.readInt(); i > 0; i--) { // read 
segmentInfos
SegmentInfo si = new 
SegmentInfo(input.readString(), input
.readInt(), directory);
vector.addElement(si);
}

if (format >= 0) { // in old format the version number 
may be at
// the end of the file
if (input.getFilePointer() >= input.length())
version = System.currentTimeMillis(); 
// old file format
// without version
// number
else
version = input.readLong(); // read 
version
}
} finally {
input.close();
}
}

public final void write(Directory directory) throws IOException {
IndexOutput output = directory.createOutput("segments.new");
try {
output.writeInt(FORMAT); // write FORMAT
output.writeLong(++version); // every write changes the 
index
output.writeInt(counter); // write counter
output.writeInt(size()); // write infos
for (int i = 0; i < size(); i++) {
SegmentInfo si = info(i);
output.writeString(si.name);
output.writeInt(si.docCount);
}
} finally {
output.close();
}

// install new segment info
directory.renameFile("segments.new", IndexFileNames.SEGMENTS);
}

/**
 * version number when this SegmentInfos was generated.
 */
public long getVersion() {
return version;
}

/**
 * Current version number from segments file.
 */
public static long readCurrentVersion(Directory directory)
throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
int format = 0;
long version = 0;
try {
format = input.readInt();
if

Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

I'm don't know list servers rules but I figured I would just include
the text of the file I changed. If that is bad form give me a heads up
and I won't do it again.

Would this change break anything or bother anyone?

package org.apache.lucene.index;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Vector;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

final class SegmentInfos {

/** The file format version, a negative number. */
/* Works since counter, the old 1st entry, is always >= 0 */
public static final int FORMAT = -1;

public int counter = 0; // used to name new segments

private Vector vector = new Vector();

/**
 * counts how often the index has been changed by adding or deleting 
docs.
 * starting with the current time in milliseconds forces to create 
unique
 * version numbers.
 */
private long version = System.currentTimeMillis();

public final SegmentInfo info(int i) {
return (SegmentInfo) vector.elementAt(i);
}

public final void read(Directory directory) throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
try {
int format = input.readInt();
if (format < 0) { // file contains explicit format info
// check that it is a format we can understand
if (format < FORMAT)
throw new IOException("Unknown format 
version: " + format);
version = input.readLong(); // read version
counter = input.readInt(); // read counter
} else { // file is in old format without explicit 
format info
counter = format;
}

for (int i = input.readInt(); i > 0; i--) { // read 
segmentInfos
SegmentInfo si = new 
SegmentInfo(input.readString(), input
.readInt(), directory);
vector.addElement(si);
}

if (format >= 0) { // in old format the version number 
may be at
// the end of the file
if (input.getFilePointer() >= input.length())
version = System.currentTimeMillis(); 
// old file format
// without version
// number
else
version = input.readLong(); // read 
version
}
} finally {
input.close();
}
}

public final void write(Directory directory) throws IOException {
IndexOutput output = directory.createOutput("segments.new");
try {
output.writeInt(FORMAT); // write FORMAT
output.writeLong(++version); // every write changes the 
index
output.writeInt(counter); // write counter
output.writeInt(size()); // write infos
for (int i = 0; i < size(); i++) {
SegmentInfo si = info(i);
output.writeString(si.name);
output.writeInt(si.docCount);
}
} finally {
output.close();
}

// install new segment info
directory.renameFile("segments.new", IndexFileNames.SEGMENTS);
}

/**
 * version number when this SegmentInfos was generated.
 */
public long getVersion() {
return version;
}

/**
 * Current version number from segments file.
 */
public static long readCurrentVersion(Directory directory)
throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
int

Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:

Interesting.
I wonder, I have a notification mechanism at my disposal as well. I
wonder if it could be worked out that, much like a mvc, an IndexReader
could be notified when the underlying Directory has changed so that
the reader can adjust itself?


Another little factor is that the IndexWriter must be closed before
the IndexReader is opened to see all the changes.

There is cost to opening and using a new IndexReader such as reading
the term index and the norms.  One would probably want to have some
sort of logic to limit how fast a new IndexReader was opened (which
diminishes the value of realtime updates to the underlying
IndexWriter).

It still should be doable though.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> > My guess is that some segment of the world cares a lot about realtime
> > coherent updates and some segment of the world needs blinding speed.
> > Part of my research is to gather the expertise of this group on these
> > issues.
>
> I hear ya...
>
> There is another part to the equation for Lucene though.
> Coherent realtime updates to the IndexWriter/RamDirectory alone
> doesn't get you all the way there since things are only readable
> through an IndexReader that needs to be reopened to see changes.
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Interesting.
I wonder, I have a notification mechanism at my disposal as well. I
wonder if it could be worked out that, much like a mvc, an IndexReader
could be notified when the underlying Directory has changed so that
the reader can adjust itself?

Cheers,
Steve


On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> My guess is that some segment of the world cares a lot about realtime
> coherent updates and some segment of the world needs blinding speed.
> Part of my research is to gather the expertise of this group on these
> issues.

I hear ya...

There is another part to the equation for Lucene though.
Coherent realtime updates to the IndexWriter/RamDirectory alone
doesn't get you all the way there since things are only readable
through an IndexReader that needs to be reopened to see changes.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:

My guess is that some segment of the world cares a lot about realtime
coherent updates and some segment of the world needs blinding speed.
Part of my research is to gather the expertise of this group on these
issues.


I hear ya...

There is another part to the equation for Lucene though.
Coherent realtime updates to the IndexWriter/RamDirectory alone
doesn't get you all the way there since things are only readable
through an IndexReader that needs to be reopened to see changes.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Good question. May or may not be performant enough. Only time (and
testing) will tell. My guess is that it will depend heavily on the
rate in which the data changes (or read write ratio).

Believe me, I'm not proposing that everyone go out and cluster lucene
with terracotta dso. I'm really just playing, researching, learning.
I'm a firm believer in using the right tool for the right job and
would never claim that any product (especially one I wrote) is right
for everyone.

My guess is that some segment of the world cares a lot about realtime
coherent updates and some segment of the world needs blinding speed.
Part of my research is to gather the expertise of this group on these
issues.

Cheers,
Steve

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

While automatically clustering java objects sure sounds cool, I have
to wonder what the performance ends up being.  Every small change to
the clustered objects is broadcast to all the nodes, correct?

Have you done any performance comparisons to see if this is a
practical approach for Lucene?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote:
> Fair question.
>
> All I did/need was take SegmentInfos and instead of subclassing Vector
> I made it contain a Vector. Went from subclassing to aggregation. As
> far as I could
> tell from reading the code it would make no difference to anyone and
> should have no performance impact (good or bad). It just allowed me to
> cluster the IndexWriter with a RAMDirectory.
>
> Maybe a little background would help. Our clustering product doesn't
> use java serialization and has no API. We just use a little config
> where one points us to what you want clustered and what java
> synchronization needs to be shared. One of the limitations that
> currently exists is that we don't support clustering subclasses of
> java collections.
>
> At this point I'm just experimenting to see if our product can cluster
> lucene in a useful/performant way. When my experimenting is complete,
> if everything is positive, I am going to write a blog on clustering
> lucene indexes but it would be awkward to do that if the people who
> run through the example have to change lucene code.
>
> Does this help?
>
> Cheers,
> Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]