Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
So I clustered this app: So I switched to clustering the RAMDirectory instead of the IndexWriter and it worked in my experiments. What I did was create a new IndexWriter on Document Adds and a new IndexSearcher on document queries. What I want to know is. How non-standard is this approach? Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > Interesting. > I wonder, I have a notification mechanism at my disposal as well. I > wonder if it could be worked out that, much like a mvc, an IndexReader > could be notified when the underlying Directory has changed so that > the reader can adjust itself? Another little factor is that the IndexWriter must be closed before the IndexReader is opened to see all the changes. There is cost to opening and using a new IndexReader such as reading the term index and the norms. One would probably want to have some sort of logic to limit how fast a new IndexReader was opened (which diminishes the value of realtime updates to the underlying IndexWriter). It still should be doable though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server > On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > > > My guess is that some segment of the world cares a lot about realtime > > > coherent updates and some segment of the world needs blinding speed. > > > Part of my research is to gather the expertise of this group on these > > > issues. > > > > I hear ya... > > > > There is another part to the equation for Lucene though. > > Coherent realtime updates to the IndexWriter/RamDirectory alone > > doesn't get you all the way there since things are only readable > > through an IndexReader that needs to be reopened to see changes. > > > > > > -Yonik > > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
Oops, I made a change and didn't test it. Doh, This should work better: package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; final class SegmentInfos { /** The file format version, a negative number. */ /* Works since counter, the old 1st entry, is always >= 0 */ public static final int FORMAT = -1; public int counter = 0; // used to name new segments private Vector vector = new Vector(); /** * counts how often the index has been changed by adding or deleting docs. * starting with the current time in milliseconds forces to create unique * version numbers. */ private long version = System.currentTimeMillis(); public final SegmentInfo info(int i) { return (SegmentInfo) vector.elementAt(i); } public final void read(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); try { int format = input.readInt(); if (format < 0) { // file contains explicit format info // check that it is a format we can understand if (format < FORMAT) throw new IOException("Unknown format version: " + format); version = input.readLong(); // read version counter = input.readInt(); // read counter } else { // file is in old format without explicit format info counter = format; } for (int i = input.readInt(); i > 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input .readInt(), directory); vector.addElement(si); } if (format >= 0) { // in old format the version number may be at // the end of the file if (input.getFilePointer() >= input.length()) version = System.currentTimeMillis(); // old file format // without version // number else version = input.readLong(); // read version } } finally { input.close(); } } public final void write(Directory directory) throws IOException { IndexOutput output = directory.createOutput("segments.new"); try { output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i < size(); i++) { SegmentInfo si = info(i); output.writeString(si.name); output.writeInt(si.docCount); } } finally { output.close(); } // install new segment info directory.renameFile("segments.new", IndexFileNames.SEGMENTS); } /** * version number when this SegmentInfos was generated. */ public long getVersion() { return version; } /** * Current version number from segments file. */ public static long readCurrentVersion(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); int format = 0; long version = 0; try { format = input.readInt(); if
Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
I'm don't know list servers rules but I figured I would just include the text of the file I changed. If that is bad form give me a heads up and I won't do it again. Would this change break anything or bother anyone? package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; final class SegmentInfos { /** The file format version, a negative number. */ /* Works since counter, the old 1st entry, is always >= 0 */ public static final int FORMAT = -1; public int counter = 0; // used to name new segments private Vector vector = new Vector(); /** * counts how often the index has been changed by adding or deleting docs. * starting with the current time in milliseconds forces to create unique * version numbers. */ private long version = System.currentTimeMillis(); public final SegmentInfo info(int i) { return (SegmentInfo) vector.elementAt(i); } public final void read(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); try { int format = input.readInt(); if (format < 0) { // file contains explicit format info // check that it is a format we can understand if (format < FORMAT) throw new IOException("Unknown format version: " + format); version = input.readLong(); // read version counter = input.readInt(); // read counter } else { // file is in old format without explicit format info counter = format; } for (int i = input.readInt(); i > 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input .readInt(), directory); vector.addElement(si); } if (format >= 0) { // in old format the version number may be at // the end of the file if (input.getFilePointer() >= input.length()) version = System.currentTimeMillis(); // old file format // without version // number else version = input.readLong(); // read version } } finally { input.close(); } } public final void write(Directory directory) throws IOException { IndexOutput output = directory.createOutput("segments.new"); try { output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i < size(); i++) { SegmentInfo si = info(i); output.writeString(si.name); output.writeInt(si.docCount); } } finally { output.close(); } // install new segment info directory.renameFile("segments.new", IndexFileNames.SEGMENTS); } /** * version number when this SegmentInfos was generated. */ public long getVersion() { return version; } /** * Current version number from segments file. */ public static long readCurrentVersion(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); int
Re: Re: Re: Re: Re: Clustering IndexWriter?
On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: Interesting. I wonder, I have a notification mechanism at my disposal as well. I wonder if it could be worked out that, much like a mvc, an IndexReader could be notified when the underlying Directory has changed so that the reader can adjust itself? Another little factor is that the IndexWriter must be closed before the IndexReader is opened to see all the changes. There is cost to opening and using a new IndexReader such as reading the term index and the norms. One would probably want to have some sort of logic to limit how fast a new IndexReader was opened (which diminishes the value of realtime updates to the underlying IndexWriter). It still should be doable though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > > My guess is that some segment of the world cares a lot about realtime > > coherent updates and some segment of the world needs blinding speed. > > Part of my research is to gather the expertise of this group on these > > issues. > > I hear ya... > > There is another part to the equation for Lucene though. > Coherent realtime updates to the IndexWriter/RamDirectory alone > doesn't get you all the way there since things are only readable > through an IndexReader that needs to be reopened to see changes. > > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Clustering IndexWriter?
Interesting. I wonder, I have a notification mechanism at my disposal as well. I wonder if it could be worked out that, much like a mvc, an IndexReader could be notified when the underlying Directory has changed so that the reader can adjust itself? Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > My guess is that some segment of the world cares a lot about realtime > coherent updates and some segment of the world needs blinding speed. > Part of my research is to gather the expertise of this group on these > issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Clustering IndexWriter?
On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Clustering IndexWriter?
Good question. May or may not be performant enough. Only time (and testing) will tell. My guess is that it will depend heavily on the rate in which the data changes (or read write ratio). Believe me, I'm not proposing that everyone go out and cluster lucene with terracotta dso. I'm really just playing, researching, learning. I'm a firm believer in using the right tool for the right job and would never claim that any product (especially one I wrote) is right for everyone. My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: While automatically clustering java objects sure sounds cool, I have to wonder what the performance ends up being. Every small change to the clustered objects is broadcast to all the nodes, correct? Have you done any performance comparisons to see if this is a practical approach for Lucene? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > Fair question. > > All I did/need was take SegmentInfos and instead of subclassing Vector > I made it contain a Vector. Went from subclassing to aggregation. As > far as I could > tell from reading the code it would make no difference to anyone and > should have no performance impact (good or bad). It just allowed me to > cluster the IndexWriter with a RAMDirectory. > > Maybe a little background would help. Our clustering product doesn't > use java serialization and has no API. We just use a little config > where one points us to what you want clustered and what java > synchronization needs to be shared. One of the limitations that > currently exists is that we don't support clustering subclasses of > java collections. > > At this point I'm just experimenting to see if our product can cluster > lucene in a useful/performant way. When my experimenting is complete, > if everything is positive, I am going to write a blog on clustering > lucene indexes but it would be awkward to do that if the people who > run through the example have to change lucene code. > > Does this help? > > Cheers, > Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]