Re: Clustering IndexWriter?

2006-09-21 Thread adasal

Don't be coy, what's your comapany?
Adam

On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote:


Warning, I'm a vendor dude but this isn't really a vendor message.

My IT guy had mentioned to me that a bunch of the open source products
we use (JIRA, JForum etc) have Lucene inside and in the name of eating
our own dog food
I tried to cluster IndexWriter (with a RAMDirectory) using our
(terracotta) clustering technology.

Took me about a half hour to get the basics working from download
time. I was wondering, do people in the real world want to be able to
cluster this stuff? Is clustering the IndexWriter really all I need to do?

If it is interesting, how do I feedback a small code change into the
project. We don't yet support subclasses of collections and
SegmentInfos subclasses Vector. I just turned it into aggregation
(that took 10 of the 30 minutes). We will support this in a future
release so it isn't a huge deal but I could get something out sooner
if the change was made.

Cheers,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Clustering IndexWriter?

2006-09-21 Thread Vic Bancroft

adasal wrote:


Don't be coy, what's your comapany?


This URL is derivable from the text, with a little search ening help . . .
**
 http://www.terracottatech.com/terracotta_spring.shtml

more,
l8r,
v


On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote:



Warning, I'm a vendor dude but this isn't really a vendor message.

My IT guy had mentioned to me that a bunch of the open source products
we use (JIRA, JForum etc) have Lucene inside and in the name of eating
our own dog food
I tried to cluster IndexWriter (with a RAMDirectory) using our
(terracotta) clustering technology.

Took me about a half hour to get the basics working from download
time. I was wondering, do people in the real world want to be able to
cluster this stuff? Is clustering the IndexWriter really all I need 
to do?


If it is interesting, how do I feedback a small code change into the
project. We don't yet support subclasses of collections and
SegmentInfos subclasses Vector. I just turned it into aggregation
(that took 10 of the 30 minutes). We will support this in a future
release so it isn't a huge deal but I could get something out sooner
if the change was made.

Cheers,
Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Yep, that's us. No secret, just didn't want to make my question an
billboard :-). Just needed a bit of info from the people who know
best.
Cheers,
steve

On 9/21/06, Vic Bancroft [EMAIL PROTECTED] wrote:

adasal wrote:

 Don't be coy, what's your comapany?

This URL is derivable from the text, with a little search ening help . . .
**
  http://www.terracottatech.com/terracotta_spring.shtml

more,
l8r,
v

 On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote:


 Warning, I'm a vendor dude but this isn't really a vendor message.

 My IT guy had mentioned to me that a bunch of the open source products
 we use (JIRA, JForum etc) have Lucene inside and in the name of eating
 our own dog food
 I tried to cluster IndexWriter (with a RAMDirectory) using our
 (terracotta) clustering technology.

 Took me about a half hour to get the basics working from download
 time. I was wondering, do people in the real world want to be able to
 cluster this stuff? Is clustering the IndexWriter really all I need
 to do?

 If it is interesting, how do I feedback a small code change into the
 project. We don't yet support subclasses of collections and
 SegmentInfos subclasses Vector. I just turned it into aggregation
 (that took 10 of the 30 minutes). We will support this in a future
 release so it isn't a huge deal but I could get something out sooner
 if the change was made.

 Cheers,
 Steve

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





--
The future is here. It's just not evenly distributed yet.
 -- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/20/06, Steve Harris [EMAIL PROTECTED] wrote:

Is clustering the IndexWriter really all I need to do?


Hi Steve,
Could you explain the details of what clustering really means in this context?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Sure,

I'm fairly new to Lucene but what I was trying to do was make it so
that an index could be shared among multiple nodes. If an index is
updated in any way it would be updated across the cluster coherently.
In my first version I was really only taking advantage of the fact
that we detect fine grained changes and can extend synchronization
across the cluster but if I can prove to myself that this is actually
useful I'll go back and mark some of the synchronize blocks/methods as
read locks to improve concurrency and reduce instrumentation to only
what is needed.

If I'm going to be able to publish the config for what I'm doing I
would need to change that one class that I mentioned above becuase we
won't support subclasses of collections for a few more months.

I'm not a very good writer. Does any of that make sense?

Summary would be:
Goals,
Usefully cluster luncene indexes for across multiple nodes.

Questions:
Is this useful in the real world
Would it be  possible to get that one small thing changed.

Cheers,
Steve

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

On 9/20/06, Steve Harris [EMAIL PROTECTED] wrote:
 Is clustering the IndexWriter really all I need to do?

Hi Steve,
Could you explain the details of what clustering really means in this context?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Otis Gospodnetic
I don't fully follow, and I don't even have the it's late! excuse.  It sounds 
like you want to have the same index on multiple nodes in the cluster and when 
a data change occurs, you want to synchronously make the same change to all 
indices in your cluster.  Is that it?

Solr has a different approach.  There, only the master index is modified, while 
slave servers copy the master index periodically.

Otis

- Original Message 
From: Steve Harris [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, September 21, 2006 11:18:43 AM
Subject: Re: Re: Clustering IndexWriter?

Sure,

I'm fairly new to Lucene but what I was trying to do was make it so
that an index could be shared among multiple nodes. If an index is
updated in any way it would be updated across the cluster coherently.
In my first version I was really only taking advantage of the fact
that we detect fine grained changes and can extend synchronization
across the cluster but if I can prove to myself that this is actually
useful I'll go back and mark some of the synchronize blocks/methods as
read locks to improve concurrency and reduce instrumentation to only
what is needed.

If I'm going to be able to publish the config for what I'm doing I
would need to change that one class that I mentioned above becuase we
won't support subclasses of collections for a few more months.

I'm not a very good writer. Does any of that make sense?

Summary would be:
Goals,
Usefully cluster luncene indexes for across multiple nodes.

Questions:
Is this useful in the real world
Would it be  possible to get that one small thing changed.

Cheers,
Steve

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 On 9/20/06, Steve Harris [EMAIL PROTECTED] wrote:
  Is clustering the IndexWriter really all I need to do?

 Hi Steve,
 Could you explain the details of what clustering really means in this 
 context?

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Clustering IndexWriter?

2006-09-21 Thread Chris Hostetter

: Questions:
: Is this useful in the real world
: Would it be  possible to get that one small thing changed.

I'm not really clear on what the small thing is that you are asking
about ... you mentioned SegmentInfos subclassing Vector, are you proposing
an alternative?  If you've got a patch that doesn't break existing
functionality or have a negative impact on performance and makes lucene
more usable in some way it would certainly be considered ... i'm just not
really clear on what change you're suggesting and how it helps make Lucene
more usable for you.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Fair question.

All I did/need was take SegmentInfos and instead of subclassing Vector
I made it contain a Vector. Went from subclassing to aggregation. As
far as I could
tell from reading the code it would make no difference to anyone and
should have no performance impact (good or bad). It just allowed me to
cluster the IndexWriter with a RAMDirectory.

Maybe a little background would help. Our clustering product doesn't
use java serialization and has no API. We just use a little config
where one points us to what you want clustered and what java
synchronization needs to be shared. One of the limitations that
currently exists is that we don't support clustering subclasses of
java collections.

At this point I'm just experimenting to see if our product can cluster
lucene in a useful/performant way. When my experimenting is complete,
if everything is positive, I am going to write a blog on clustering
lucene indexes but it would be awkward to do that if the people who
run through the example have to change lucene code.

Does this help?

Cheers,
Steve

On 9/21/06, Chris Hostetter [EMAIL PROTECTED] wrote:


: Questions:
: Is this useful in the real world
: Would it be  possible to get that one small thing changed.

I'm not really clear on what the small thing is that you are asking
about ... you mentioned SegmentInfos subclassing Vector, are you proposing
an alternative?  If you've got a patch that doesn't break existing
functionality or have a negative impact on performance and makes lucene
more usable in some way it would certainly be considered ... i'm just not
really clear on what change you're suggesting and how it helps make Lucene
more usable for you.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Good question. May or may not be performant enough. Only time (and
testing) will tell. My guess is that it will depend heavily on the
rate in which the data changes (or read write ratio).

Believe me, I'm not proposing that everyone go out and cluster lucene
with terracotta dso. I'm really just playing, researching, learning.
I'm a firm believer in using the right tool for the right job and
would never claim that any product (especially one I wrote) is right
for everyone.

My guess is that some segment of the world cares a lot about realtime
coherent updates and some segment of the world needs blinding speed.
Part of my research is to gather the expertise of this group on these
issues.

Cheers,
Steve

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

While automatically clustering java objects sure sounds cool, I have
to wonder what the performance ends up being.  Every small change to
the clustered objects is broadcast to all the nodes, correct?

Have you done any performance comparisons to see if this is a
practical approach for Lucene?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote:
 Fair question.

 All I did/need was take SegmentInfos and instead of subclassing Vector
 I made it contain a Vector. Went from subclassing to aggregation. As
 far as I could
 tell from reading the code it would make no difference to anyone and
 should have no performance impact (good or bad). It just allowed me to
 cluster the IndexWriter with a RAMDirectory.

 Maybe a little background would help. Our clustering product doesn't
 use java serialization and has no API. We just use a little config
 where one points us to what you want clustered and what java
 synchronization needs to be shared. One of the limitations that
 currently exists is that we don't support clustering subclasses of
 java collections.

 At this point I'm just experimenting to see if our product can cluster
 lucene in a useful/performant way. When my experimenting is complete,
 if everything is positive, I am going to write a blog on clustering
 lucene indexes but it would be awkward to do that if the people who
 run through the example have to change lucene code.

 Does this help?

 Cheers,
 Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote:

My guess is that some segment of the world cares a lot about realtime
coherent updates and some segment of the world needs blinding speed.
Part of my research is to gather the expertise of this group on these
issues.


I hear ya...

There is another part to the equation for Lucene though.
Coherent realtime updates to the IndexWriter/RamDirectory alone
doesn't get you all the way there since things are only readable
through an IndexReader that needs to be reopened to see changes.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Interesting.
I wonder, I have a notification mechanism at my disposal as well. I
wonder if it could be worked out that, much like a mvc, an IndexReader
could be notified when the underlying Directory has changed so that
the reader can adjust itself?

Cheers,
Steve


On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote:
 My guess is that some segment of the world cares a lot about realtime
 coherent updates and some segment of the world needs blinding speed.
 Part of my research is to gather the expertise of this group on these
 issues.

I hear ya...

There is another part to the equation for Lucene though.
Coherent realtime updates to the IndexWriter/RamDirectory alone
doesn't get you all the way there since things are only readable
through an IndexReader that needs to be reopened to see changes.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote:

Interesting.
I wonder, I have a notification mechanism at my disposal as well. I
wonder if it could be worked out that, much like a mvc, an IndexReader
could be notified when the underlying Directory has changed so that
the reader can adjust itself?


Another little factor is that the IndexWriter must be closed before
the IndexReader is opened to see all the changes.

There is cost to opening and using a new IndexReader such as reading
the term index and the norms.  One would probably want to have some
sort of logic to limit how fast a new IndexReader was opened (which
diminishes the value of realtime updates to the underlying
IndexWriter).

It still should be doable though.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:
 On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote:
  My guess is that some segment of the world cares a lot about realtime
  coherent updates and some segment of the world needs blinding speed.
  Part of my research is to gather the expertise of this group on these
  issues.

 I hear ya...

 There is another part to the equation for Lucene though.
 Coherent realtime updates to the IndexWriter/RamDirectory alone
 doesn't get you all the way there since things are only readable
 through an IndexReader that needs to be reopened to see changes.


 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

I'm don't know list servers rules but I figured I would just include
the text of the file I changed. If that is bad form give me a heads up
and I won't do it again.

Would this change break anything or bother anyone?

package org.apache.lucene.index;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the License);
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Vector;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

final class SegmentInfos {

/** The file format version, a negative number. */
/* Works since counter, the old 1st entry, is always = 0 */
public static final int FORMAT = -1;

public int counter = 0; // used to name new segments

private Vector vector = new Vector();

/**
 * counts how often the index has been changed by adding or deleting 
docs.
 * starting with the current time in milliseconds forces to create 
unique
 * version numbers.
 */
private long version = System.currentTimeMillis();

public final SegmentInfo info(int i) {
return (SegmentInfo) vector.elementAt(i);
}

public final void read(Directory directory) throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
try {
int format = input.readInt();
if (format  0) { // file contains explicit format info
// check that it is a format we can understand
if (format  FORMAT)
throw new IOException(Unknown format 
version:  + format);
version = input.readLong(); // read version
counter = input.readInt(); // read counter
} else { // file is in old format without explicit 
format info
counter = format;
}

for (int i = input.readInt(); i  0; i--) { // read 
segmentInfos
SegmentInfo si = new 
SegmentInfo(input.readString(), input
.readInt(), directory);
vector.addElement(si);
}

if (format = 0) { // in old format the version number 
may be at
// the end of the file
if (input.getFilePointer() = input.length())
version = System.currentTimeMillis(); 
// old file format
// without version
// number
else
version = input.readLong(); // read 
version
}
} finally {
input.close();
}
}

public final void write(Directory directory) throws IOException {
IndexOutput output = directory.createOutput(segments.new);
try {
output.writeInt(FORMAT); // write FORMAT
output.writeLong(++version); // every write changes the 
index
output.writeInt(counter); // write counter
output.writeInt(size()); // write infos
for (int i = 0; i  size(); i++) {
SegmentInfo si = info(i);
output.writeString(si.name);
output.writeInt(si.docCount);
}
} finally {
output.close();
}

// install new segment info
directory.renameFile(segments.new, IndexFileNames.SEGMENTS);
}

/**
 * version number when this SegmentInfos was generated.
 */
public long getVersion() {
return version;
}

/**
 * Current version number from segments file.
 */
public static long readCurrentVersion(Directory directory)
throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
int format = 0;

Re: Re: Re: Re: Re: Re: Re: Clustering IndexWriter?

2006-09-21 Thread Steve Harris

Oops, I made a change and didn't test it. Doh,
This should work better:

package org.apache.lucene.index;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the License);
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import java.io.IOException;
import java.util.Vector;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

final class SegmentInfos {

/** The file format version, a negative number. */
/* Works since counter, the old 1st entry, is always = 0 */
public static final int FORMAT = -1;

public int counter = 0; // used to name new segments

private Vector vector = new Vector();

/**
 * counts how often the index has been changed by adding or deleting 
docs.
 * starting with the current time in milliseconds forces to create 
unique
 * version numbers.
 */
private long version = System.currentTimeMillis();

public final SegmentInfo info(int i) {
return (SegmentInfo) vector.elementAt(i);
}

public final void read(Directory directory) throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
try {
int format = input.readInt();
if (format  0) { // file contains explicit format info
// check that it is a format we can understand
if (format  FORMAT)
throw new IOException(Unknown format 
version:  + format);
version = input.readLong(); // read version
counter = input.readInt(); // read counter
} else { // file is in old format without explicit 
format info
counter = format;
}

for (int i = input.readInt(); i  0; i--) { // read 
segmentInfos
SegmentInfo si = new 
SegmentInfo(input.readString(), input
.readInt(), directory);
vector.addElement(si);
}

if (format = 0) { // in old format the version number 
may be at
// the end of the file
if (input.getFilePointer() = input.length())
version = System.currentTimeMillis(); 
// old file format
// without version
// number
else
version = input.readLong(); // read 
version
}
} finally {
input.close();
}
}

public final void write(Directory directory) throws IOException {
IndexOutput output = directory.createOutput(segments.new);
try {
output.writeInt(FORMAT); // write FORMAT
output.writeLong(++version); // every write changes the 
index
output.writeInt(counter); // write counter
output.writeInt(size()); // write infos
for (int i = 0; i  size(); i++) {
SegmentInfo si = info(i);
output.writeString(si.name);
output.writeInt(si.docCount);
}
} finally {
output.close();
}

// install new segment info
directory.renameFile(segments.new, IndexFileNames.SEGMENTS);
}

/**
 * version number when this SegmentInfos was generated.
 */
public long getVersion() {
return version;
}

/**
 * Current version number from segments file.
 */
public static long readCurrentVersion(Directory directory)
throws IOException {

IndexInput input = directory.openInput(IndexFileNames.SEGMENTS);
int format = 0;
long version = 0;
try {
format = input.readInt();
if (format  0) {