Re: Clustering IndexWriter?
Don't be coy, what's your comapany? Adam On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote: Warning, I'm a vendor dude but this isn't really a vendor message. My IT guy had mentioned to me that a bunch of the open source products we use (JIRA, JForum etc) have Lucene inside and in the name of eating our own dog food I tried to cluster IndexWriter (with a RAMDirectory) using our (terracotta) clustering technology. Took me about a half hour to get the basics working from download time. I was wondering, do people in the real world want to be able to cluster this stuff? Is clustering the IndexWriter really all I need to do? If it is interesting, how do I feedback a small code change into the project. We don't yet support subclasses of collections and SegmentInfos subclasses Vector. I just turned it into aggregation (that took 10 of the 30 minutes). We will support this in a future release so it isn't a huge deal but I could get something out sooner if the change was made. Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering IndexWriter?
adasal wrote: Don't be coy, what's your comapany? This URL is derivable from the text, with a little search ening help . . . ** http://www.terracottatech.com/terracotta_spring.shtml more, l8r, v On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote: Warning, I'm a vendor dude but this isn't really a vendor message. My IT guy had mentioned to me that a bunch of the open source products we use (JIRA, JForum etc) have Lucene inside and in the name of eating our own dog food I tried to cluster IndexWriter (with a RAMDirectory) using our (terracotta) clustering technology. Took me about a half hour to get the basics working from download time. I was wondering, do people in the real world want to be able to cluster this stuff? Is clustering the IndexWriter really all I need to do? If it is interesting, how do I feedback a small code change into the project. We don't yet support subclasses of collections and SegmentInfos subclasses Vector. I just turned it into aggregation (that took 10 of the 30 minutes). We will support this in a future release so it isn't a huge deal but I could get something out sooner if the change was made. Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
Yep, that's us. No secret, just didn't want to make my question an billboard :-). Just needed a bit of info from the people who know best. Cheers, steve On 9/21/06, Vic Bancroft [EMAIL PROTECTED] wrote: adasal wrote: Don't be coy, what's your comapany? This URL is derivable from the text, with a little search ening help . . . ** http://www.terracottatech.com/terracotta_spring.shtml more, l8r, v On 21/09/06, Steve Harris [EMAIL PROTECTED] wrote: Warning, I'm a vendor dude but this isn't really a vendor message. My IT guy had mentioned to me that a bunch of the open source products we use (JIRA, JForum etc) have Lucene inside and in the name of eating our own dog food I tried to cluster IndexWriter (with a RAMDirectory) using our (terracotta) clustering technology. Took me about a half hour to get the basics working from download time. I was wondering, do people in the real world want to be able to cluster this stuff? Is clustering the IndexWriter really all I need to do? If it is interesting, how do I feedback a small code change into the project. We don't yet support subclasses of collections and SegmentInfos subclasses Vector. I just turned it into aggregation (that took 10 of the 30 minutes). We will support this in a future release so it isn't a huge deal but I could get something out sooner if the change was made. Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- The future is here. It's just not evenly distributed yet. -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering IndexWriter?
On 9/20/06, Steve Harris [EMAIL PROTECTED] wrote: Is clustering the IndexWriter really all I need to do? Hi Steve, Could you explain the details of what clustering really means in this context? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
Sure, I'm fairly new to Lucene but what I was trying to do was make it so that an index could be shared among multiple nodes. If an index is updated in any way it would be updated across the cluster coherently. In my first version I was really only taking advantage of the fact that we detect fine grained changes and can extend synchronization across the cluster but if I can prove to myself that this is actually useful I'll go back and mark some of the synchronize blocks/methods as read locks to improve concurrency and reduce instrumentation to only what is needed. If I'm going to be able to publish the config for what I'm doing I would need to change that one class that I mentioned above becuase we won't support subclasses of collections for a few more months. I'm not a very good writer. Does any of that make sense? Summary would be: Goals, Usefully cluster luncene indexes for across multiple nodes. Questions: Is this useful in the real world Would it be possible to get that one small thing changed. Cheers, Steve On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/20/06, Steve Harris [EMAIL PROTECTED] wrote: Is clustering the IndexWriter really all I need to do? Hi Steve, Could you explain the details of what clustering really means in this context? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
I don't fully follow, and I don't even have the it's late! excuse. It sounds like you want to have the same index on multiple nodes in the cluster and when a data change occurs, you want to synchronously make the same change to all indices in your cluster. Is that it? Solr has a different approach. There, only the master index is modified, while slave servers copy the master index periodically. Otis - Original Message From: Steve Harris [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 21, 2006 11:18:43 AM Subject: Re: Re: Clustering IndexWriter? Sure, I'm fairly new to Lucene but what I was trying to do was make it so that an index could be shared among multiple nodes. If an index is updated in any way it would be updated across the cluster coherently. In my first version I was really only taking advantage of the fact that we detect fine grained changes and can extend synchronization across the cluster but if I can prove to myself that this is actually useful I'll go back and mark some of the synchronize blocks/methods as read locks to improve concurrency and reduce instrumentation to only what is needed. If I'm going to be able to publish the config for what I'm doing I would need to change that one class that I mentioned above becuase we won't support subclasses of collections for a few more months. I'm not a very good writer. Does any of that make sense? Summary would be: Goals, Usefully cluster luncene indexes for across multiple nodes. Questions: Is this useful in the real world Would it be possible to get that one small thing changed. Cheers, Steve On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/20/06, Steve Harris [EMAIL PROTECTED] wrote: Is clustering the IndexWriter really all I need to do? Hi Steve, Could you explain the details of what clustering really means in this context? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
: Questions: : Is this useful in the real world : Would it be possible to get that one small thing changed. I'm not really clear on what the small thing is that you are asking about ... you mentioned SegmentInfos subclassing Vector, are you proposing an alternative? If you've got a patch that doesn't break existing functionality or have a negative impact on performance and makes lucene more usable in some way it would certainly be considered ... i'm just not really clear on what change you're suggesting and how it helps make Lucene more usable for you. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Clustering IndexWriter?
Fair question. All I did/need was take SegmentInfos and instead of subclassing Vector I made it contain a Vector. Went from subclassing to aggregation. As far as I could tell from reading the code it would make no difference to anyone and should have no performance impact (good or bad). It just allowed me to cluster the IndexWriter with a RAMDirectory. Maybe a little background would help. Our clustering product doesn't use java serialization and has no API. We just use a little config where one points us to what you want clustered and what java synchronization needs to be shared. One of the limitations that currently exists is that we don't support clustering subclasses of java collections. At this point I'm just experimenting to see if our product can cluster lucene in a useful/performant way. When my experimenting is complete, if everything is positive, I am going to write a blog on clustering lucene indexes but it would be awkward to do that if the people who run through the example have to change lucene code. Does this help? Cheers, Steve On 9/21/06, Chris Hostetter [EMAIL PROTECTED] wrote: : Questions: : Is this useful in the real world : Would it be possible to get that one small thing changed. I'm not really clear on what the small thing is that you are asking about ... you mentioned SegmentInfos subclassing Vector, are you proposing an alternative? If you've got a patch that doesn't break existing functionality or have a negative impact on performance and makes lucene more usable in some way it would certainly be considered ... i'm just not really clear on what change you're suggesting and how it helps make Lucene more usable for you. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Clustering IndexWriter?
Good question. May or may not be performant enough. Only time (and testing) will tell. My guess is that it will depend heavily on the rate in which the data changes (or read write ratio). Believe me, I'm not proposing that everyone go out and cluster lucene with terracotta dso. I'm really just playing, researching, learning. I'm a firm believer in using the right tool for the right job and would never claim that any product (especially one I wrote) is right for everyone. My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. Cheers, Steve On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: While automatically clustering java objects sure sounds cool, I have to wonder what the performance ends up being. Every small change to the clustered objects is broadcast to all the nodes, correct? Have you done any performance comparisons to see if this is a practical approach for Lucene? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote: Fair question. All I did/need was take SegmentInfos and instead of subclassing Vector I made it contain a Vector. Went from subclassing to aggregation. As far as I could tell from reading the code it would make no difference to anyone and should have no performance impact (good or bad). It just allowed me to cluster the IndexWriter with a RAMDirectory. Maybe a little background would help. Our clustering product doesn't use java serialization and has no API. We just use a little config where one points us to what you want clustered and what java synchronization needs to be shared. One of the limitations that currently exists is that we don't support clustering subclasses of java collections. At this point I'm just experimenting to see if our product can cluster lucene in a useful/performant way. When my experimenting is complete, if everything is positive, I am going to write a blog on clustering lucene indexes but it would be awkward to do that if the people who run through the example have to change lucene code. Does this help? Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Clustering IndexWriter?
On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote: My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Clustering IndexWriter?
Interesting. I wonder, I have a notification mechanism at my disposal as well. I wonder if it could be worked out that, much like a mvc, an IndexReader could be notified when the underlying Directory has changed so that the reader can adjust itself? Cheers, Steve On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote: My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Clustering IndexWriter?
On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote: Interesting. I wonder, I have a notification mechanism at my disposal as well. I wonder if it could be worked out that, much like a mvc, an IndexReader could be notified when the underlying Directory has changed so that the reader can adjust itself? Another little factor is that the IndexWriter must be closed before the IndexReader is opened to see all the changes. There is cost to opening and using a new IndexReader such as reading the term index and the norms. One would probably want to have some sort of logic to limit how fast a new IndexReader was opened (which diminishes the value of realtime updates to the underlying IndexWriter). It still should be doable though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/21/06, Steve Harris [EMAIL PROTECTED] wrote: My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
I'm don't know list servers rules but I figured I would just include the text of the file I changed. If that is bad form give me a heads up and I won't do it again. Would this change break anything or bother anyone? package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; final class SegmentInfos { /** The file format version, a negative number. */ /* Works since counter, the old 1st entry, is always = 0 */ public static final int FORMAT = -1; public int counter = 0; // used to name new segments private Vector vector = new Vector(); /** * counts how often the index has been changed by adding or deleting docs. * starting with the current time in milliseconds forces to create unique * version numbers. */ private long version = System.currentTimeMillis(); public final SegmentInfo info(int i) { return (SegmentInfo) vector.elementAt(i); } public final void read(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); try { int format = input.readInt(); if (format 0) { // file contains explicit format info // check that it is a format we can understand if (format FORMAT) throw new IOException(Unknown format version: + format); version = input.readLong(); // read version counter = input.readInt(); // read counter } else { // file is in old format without explicit format info counter = format; } for (int i = input.readInt(); i 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input .readInt(), directory); vector.addElement(si); } if (format = 0) { // in old format the version number may be at // the end of the file if (input.getFilePointer() = input.length()) version = System.currentTimeMillis(); // old file format // without version // number else version = input.readLong(); // read version } } finally { input.close(); } } public final void write(Directory directory) throws IOException { IndexOutput output = directory.createOutput(segments.new); try { output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i size(); i++) { SegmentInfo si = info(i); output.writeString(si.name); output.writeInt(si.docCount); } } finally { output.close(); } // install new segment info directory.renameFile(segments.new, IndexFileNames.SEGMENTS); } /** * version number when this SegmentInfos was generated. */ public long getVersion() { return version; } /** * Current version number from segments file. */ public static long readCurrentVersion(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); int format = 0;
Re: Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
Oops, I made a change and didn't test it. Doh, This should work better: package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; final class SegmentInfos { /** The file format version, a negative number. */ /* Works since counter, the old 1st entry, is always = 0 */ public static final int FORMAT = -1; public int counter = 0; // used to name new segments private Vector vector = new Vector(); /** * counts how often the index has been changed by adding or deleting docs. * starting with the current time in milliseconds forces to create unique * version numbers. */ private long version = System.currentTimeMillis(); public final SegmentInfo info(int i) { return (SegmentInfo) vector.elementAt(i); } public final void read(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); try { int format = input.readInt(); if (format 0) { // file contains explicit format info // check that it is a format we can understand if (format FORMAT) throw new IOException(Unknown format version: + format); version = input.readLong(); // read version counter = input.readInt(); // read counter } else { // file is in old format without explicit format info counter = format; } for (int i = input.readInt(); i 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input .readInt(), directory); vector.addElement(si); } if (format = 0) { // in old format the version number may be at // the end of the file if (input.getFilePointer() = input.length()) version = System.currentTimeMillis(); // old file format // without version // number else version = input.readLong(); // read version } } finally { input.close(); } } public final void write(Directory directory) throws IOException { IndexOutput output = directory.createOutput(segments.new); try { output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i size(); i++) { SegmentInfo si = info(i); output.writeString(si.name); output.writeInt(si.docCount); } } finally { output.close(); } // install new segment info directory.renameFile(segments.new, IndexFileNames.SEGMENTS); } /** * version number when this SegmentInfos was generated. */ public long getVersion() { return version; } /** * Current version number from segments file. */ public static long readCurrentVersion(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); int format = 0; long version = 0; try { format = input.readInt(); if (format 0) {