Re: Distributed Indexes, Searches and HDFS
On 9/21/06, Chris D <[EMAIL PROTECTED]> wrote: The cronjob/link solution which is quite clean, doesn't work well in a windows environment. While it's my favorite, no dice... Rats. There may be hope yet for that on Windows. Hard links work on Windows, but the only problem is that you can't rename/delete any links when the file is open. Michael McCandless is working on a patch that would eliminate all renames (and deletes can be handled by deferring them). http://www.nabble.com/Re%3A--Solr-Wiki--Update-of-%22TaskList%22-by-YonikSeeley-tf2081816.html#a5736265 http://www.nabble.com/-jira--Created%3A-%28LUCENE-665%29-temporary-file-access-denied-on-Windows-tf2167540.html#a6295771 -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
So I clustered this app: So I switched to clustering the RAMDirectory instead of the IndexWriter and it worked in my experiments. What I did was create a new IndexWriter on Document Adds and a new IndexSearcher on document queries. What I want to know is. How non-standard is this approach? Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > Interesting. > I wonder, I have a notification mechanism at my disposal as well. I > wonder if it could be worked out that, much like a mvc, an IndexReader > could be notified when the underlying Directory has changed so that > the reader can adjust itself? Another little factor is that the IndexWriter must be closed before the IndexReader is opened to see all the changes. There is cost to opening and using a new IndexReader such as reading the term index and the norms. One would probably want to have some sort of logic to limit how fast a new IndexReader was opened (which diminishes the value of realtime updates to the underlying IndexWriter). It still should be doable though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server > On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > > > My guess is that some segment of the world cares a lot about realtime > > > coherent updates and some segment of the world needs blinding speed. > > > Part of my research is to gather the expertise of this group on these > > > issues. > > > > I hear ya... > > > > There is another part to the equation for Lucene though. > > Coherent realtime updates to the IndexWriter/RamDirectory alone > > doesn't get you all the way there since things are only readable > > through an IndexReader that needs to be reopened to see changes. > > > > > > -Yonik > > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
Oops, I made a change and didn't test it. Doh, This should work better: package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; final class SegmentInfos { /** The file format version, a negative number. */ /* Works since counter, the old 1st entry, is always >= 0 */ public static final int FORMAT = -1; public int counter = 0; // used to name new segments private Vector vector = new Vector(); /** * counts how often the index has been changed by adding or deleting docs. * starting with the current time in milliseconds forces to create unique * version numbers. */ private long version = System.currentTimeMillis(); public final SegmentInfo info(int i) { return (SegmentInfo) vector.elementAt(i); } public final void read(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); try { int format = input.readInt(); if (format < 0) { // file contains explicit format info // check that it is a format we can understand if (format < FORMAT) throw new IOException("Unknown format version: " + format); version = input.readLong(); // read version counter = input.readInt(); // read counter } else { // file is in old format without explicit format info counter = format; } for (int i = input.readInt(); i > 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input .readInt(), directory); vector.addElement(si); } if (format >= 0) { // in old format the version number may be at // the end of the file if (input.getFilePointer() >= input.length()) version = System.currentTimeMillis(); // old file format // without version // number else version = input.readLong(); // read version } } finally { input.close(); } } public final void write(Directory directory) throws IOException { IndexOutput output = directory.createOutput("segments.new"); try { output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i < size(); i++) { SegmentInfo si = info(i); output.writeString(si.name); output.writeInt(si.docCount); } } finally { output.close(); } // install new segment info directory.renameFile("segments.new", IndexFileNames.SEGMENTS); } /** * version number when this SegmentInfos was generated. */ public long getVersion() { return version; } /** * Current version number from segments file. */ public static long readCurrentVersion(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); int format = 0; long version = 0; try { format = input.readInt(); if
Distributed Indexes, Searches and HDFS
Hi List, As a bit of an experiment I'm redoing some of our indexing and searching code to try to make it easier to manage and distributed. The system has to modify its indexes frequently, sometimes in huge batches, and the documents in the indexes are frequently modified (deleted, modified and readded). Just for scale we're wanting the system to be capable of searching a terabyte or so of data. Currently we have a bunch of index machines indexing to a local file system, every hour or so they merge to a group of indexes stored on NFS or similar common filesystem, and the search nodes retrieve the new indexes and search on those. The merge can take about as long as it took to originally index the files, since it has to re-index the "contents" field since that field isn't stored. After reading this thread: http://www.gossamer-threads.com/lists/lucene/java-user/13803#13803 There were several good suggestions but I'm curious, is there a generally accepted best practice of distributing lucene? The cronjob/link solution which is quite clean, doesn't work well in a windows environment. While it's my favorite, no dice... Rats. So I decided to experiment with a couple different ideas, and I have some questions. 1) Indexing and Searching Directly from HDFS Indexing to HDFS is possible with a patch if we don't use CFS. While not ideal performance-wise, it's reliable and takes care of data redundancy, component failure and means that I can have cheap small drives instead of a large expensive NAS. It's also quite simple to implement (see Nutch's indexer.FsDirectory for the Directory implmentation) So I would have several indexes (ie 16) and the same number of indexers, and a searcher for each index (possibly in the same process) that searches each one directly from HDFS. One problem I'm having is an occasional filenotfound exception. (Probably locking related) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /index/_3.f0 at org.apache.hadoop.dfs.NameNode.open(NameNode.java:178) at sun.reflect.GeneratedMethodAccessor41.invoke (Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call (RPC.java:332) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:468) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:245) It comes out of the Searcher when I try to do a search while things are being indexed. I'd be interested to know what exactly is happening when this exception is thrown, maybe I can design around it. (Do synchronization at the appropriate times or similar) 2) Index Locally, Search in HDFS I haven't implemented this but I was thinking something along the lines of merging every little while and having the searchers refresh after that's finished. I still have a problem with the merge taking a fairly long time and if a node fails we lose the documents stored locally in that index. 3) Index HDFS, Search Locally The system indexes to HDFS and the searchers ask the indexers to pause while it retrieves the indexes from the store. It's then searched locally and the Indexers continue trucking along. This, in my head, seems to work alright, at least until the indexes get very large and copying them is prohibitive. (Is there a java rsync?) I'll have to investigate how much a performance hit indexing to the network actually is. If anyone has any numbers I would be interested in seeing them. 4) Map/Reduce I don't know a lot about this and haven't been able to find much on applying map/reduce to lucene indexing. Well, except for the Nutch source code, which is rather difficult to sort through for an overview. So if anyone has a snippet or a good overview I could look over I would be grateful. Even if you can just point at a critical part in Nutch that would also be quite helpful. 5) Anything else I would appreciate any insight anyone has on distributing indexes, either on list or off. Many Thanks, Chris PS. Sorry if this got double posted. Didn't seem to get through first time.
Re: Re: Re: Re: Re: Re: Clustering IndexWriter?
I'm don't know list servers rules but I figured I would just include the text of the file I changed. If that is bad form give me a heads up and I won't do it again. Would this change break anything or bother anyone? package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.Vector; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; final class SegmentInfos { /** The file format version, a negative number. */ /* Works since counter, the old 1st entry, is always >= 0 */ public static final int FORMAT = -1; public int counter = 0; // used to name new segments private Vector vector = new Vector(); /** * counts how often the index has been changed by adding or deleting docs. * starting with the current time in milliseconds forces to create unique * version numbers. */ private long version = System.currentTimeMillis(); public final SegmentInfo info(int i) { return (SegmentInfo) vector.elementAt(i); } public final void read(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); try { int format = input.readInt(); if (format < 0) { // file contains explicit format info // check that it is a format we can understand if (format < FORMAT) throw new IOException("Unknown format version: " + format); version = input.readLong(); // read version counter = input.readInt(); // read counter } else { // file is in old format without explicit format info counter = format; } for (int i = input.readInt(); i > 0; i--) { // read segmentInfos SegmentInfo si = new SegmentInfo(input.readString(), input .readInt(), directory); vector.addElement(si); } if (format >= 0) { // in old format the version number may be at // the end of the file if (input.getFilePointer() >= input.length()) version = System.currentTimeMillis(); // old file format // without version // number else version = input.readLong(); // read version } } finally { input.close(); } } public final void write(Directory directory) throws IOException { IndexOutput output = directory.createOutput("segments.new"); try { output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i < size(); i++) { SegmentInfo si = info(i); output.writeString(si.name); output.writeInt(si.docCount); } } finally { output.close(); } // install new segment info directory.renameFile("segments.new", IndexFileNames.SEGMENTS); } /** * version number when this SegmentInfos was generated. */ public long getVersion() { return version; } /** * Current version number from segments file. */ public static long readCurrentVersion(Directory directory) throws IOException { IndexInput input = directory.openInput(IndexFileNames.SEGMENTS); int
Re: Re: Re: Re: Re: Clustering IndexWriter?
On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: Interesting. I wonder, I have a notification mechanism at my disposal as well. I wonder if it could be worked out that, much like a mvc, an IndexReader could be notified when the underlying Directory has changed so that the reader can adjust itself? Another little factor is that the IndexWriter must be closed before the IndexReader is opened to see all the changes. There is cost to opening and using a new IndexReader such as reading the term index and the norms. One would probably want to have some sort of logic to limit how fast a new IndexReader was opened (which diminishes the value of realtime updates to the underlying IndexWriter). It still should be doable though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > > My guess is that some segment of the world cares a lot about realtime > > coherent updates and some segment of the world needs blinding speed. > > Part of my research is to gather the expertise of this group on these > > issues. > > I hear ya... > > There is another part to the equation for Lucene though. > Coherent realtime updates to the IndexWriter/RamDirectory alone > doesn't get you all the way there since things are only readable > through an IndexReader that needs to be reopened to see changes. > > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: help on Lock.obtain(lockWaitTimeout)
For obtain(timeout), to prevent waiting too long you could compute the maximum number of times that obtain() can be executed (assuming, as in current code, that obtain() executes in no time). Then break if either it was executed sufficiently many times or if time is up. I don't see how to prevent waiting too short. Btw, I wonder what happens if the time change as of sync occurs in the middle of the sleep - since sleep is implemented natively this must be taken care of correctly by the underlying OS...? [EMAIL PROTECTED] wrote on 21/09/2006 13:05:06: > On 9/21/06, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Anyway, my first reaction was to change this to use > > System.currentTimeMillis() to measure elapsed time, but then I > > remembered is a dangerous approach because whenever the clock on the > > machine is updated (eg by a time-sync NTP client) it would mess up > > this function, causing it to either take longer than was asked for (if > > clock is moved backwards) or, to timeout in [much] less time than was > > asked for (if clock was moved forwards). > > Um, wow... that's thorough design work! > > In this case, I don't think it's something to worry about though. > NTP corrections are likely to be very small, not on the scale of > lock-obtain timeouts. > If one can't obtain a lock, it's due to something else asynchronously > happening, and that throws a lot bigger time variation into the > equation anyway. > > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Re: Clustering IndexWriter?
Interesting. I wonder, I have a notification mechanism at my disposal as well. I wonder if it could be worked out that, much like a mvc, an IndexReader could be notified when the underlying Directory has changed so that the reader can adjust itself? Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > My guess is that some segment of the world cares a lot about realtime > coherent updates and some segment of the world needs blinding speed. > Part of my research is to gather the expertise of this group on these > issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Clustering IndexWriter?
On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. I hear ya... There is another part to the equation for Lucene though. Coherent realtime updates to the IndexWriter/RamDirectory alone doesn't get you all the way there since things are only readable through an IndexReader that needs to be reopened to see changes. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Re: Clustering IndexWriter?
Good question. May or may not be performant enough. Only time (and testing) will tell. My guess is that it will depend heavily on the rate in which the data changes (or read write ratio). Believe me, I'm not proposing that everyone go out and cluster lucene with terracotta dso. I'm really just playing, researching, learning. I'm a firm believer in using the right tool for the right job and would never claim that any product (especially one I wrote) is right for everyone. My guess is that some segment of the world cares a lot about realtime coherent updates and some segment of the world needs blinding speed. Part of my research is to gather the expertise of this group on these issues. Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: While automatically clustering java objects sure sounds cool, I have to wonder what the performance ends up being. Every small change to the clustered objects is broadcast to all the nodes, correct? Have you done any performance comparisons to see if this is a practical approach for Lucene? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: > Fair question. > > All I did/need was take SegmentInfos and instead of subclassing Vector > I made it contain a Vector. Went from subclassing to aggregation. As > far as I could > tell from reading the code it would make no difference to anyone and > should have no performance impact (good or bad). It just allowed me to > cluster the IndexWriter with a RAMDirectory. > > Maybe a little background would help. Our clustering product doesn't > use java serialization and has no API. We just use a little config > where one points us to what you want clustered and what java > synchronization needs to be shared. One of the limitations that > currently exists is that we don't support clustering subclasses of > java collections. > > At this point I'm just experimenting to see if our product can cluster > lucene in a useful/performant way. When my experimenting is complete, > if everything is positive, I am going to write a blog on clustering > lucene indexes but it would be awkward to do that if the people who > run through the example have to change lucene code. > > Does this help? > > Cheers, > Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Clustering IndexWriter?
While automatically clustering java objects sure sounds cool, I have to wonder what the performance ends up being. Every small change to the clustered objects is broadcast to all the nodes, correct? Have you done any performance comparisons to see if this is a practical approach for Lucene? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 9/21/06, Steve Harris <[EMAIL PROTECTED]> wrote: Fair question. All I did/need was take SegmentInfos and instead of subclassing Vector I made it contain a Vector. Went from subclassing to aggregation. As far as I could tell from reading the code it would make no difference to anyone and should have no performance impact (good or bad). It just allowed me to cluster the IndexWriter with a RAMDirectory. Maybe a little background would help. Our clustering product doesn't use java serialization and has no API. We just use a little config where one points us to what you want clustered and what java synchronization needs to be shared. One of the limitations that currently exists is that we don't support clustering subclasses of java collections. At this point I'm just experimenting to see if our product can cluster lucene in a useful/performant way. When my experimenting is complete, if everything is positive, I am going to write a blog on clustering lucene indexes but it would be awkward to do that if the people who run through the example have to change lucene code. Does this help? Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: help on Lock.obtain(lockWaitTimeout)
On 9/21/06, Michael McCandless <[EMAIL PROTECTED]> wrote: Anyway, my first reaction was to change this to use System.currentTimeMillis() to measure elapsed time, but then I remembered is a dangerous approach because whenever the clock on the machine is updated (eg by a time-sync NTP client) it would mess up this function, causing it to either take longer than was asked for (if clock is moved backwards) or, to timeout in [much] less time than was asked for (if clock was moved forwards). Um, wow... that's thorough design work! In this case, I don't think it's something to worry about though. NTP corrections are likely to be very small, not on the scale of lock-obtain timeouts. If one can't obtain a lock, it's due to something else asynchronously happening, and that throws a lot bigger time variation into the equation anyway. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-665) temporary file access denied on Windows
Thanks for the comments! Indeed the first version I wrote followed the pattern you suggest (let's name it pattern_1 for the discussion). However with pattern_1 I could not cover the case of a method originally not throwing an exception. The problem is that in pattern_1 we have to catch the exception before deciding whether to wait or not. But if the decision is not to wait, the caught exception must be thrown, - which is not allowed by the original method signature. That's why I made waitAgain() to (1) not wait in the first call, and (2) return true iff another call to waitAgain() is anticipated. This allows to use the same code pattern for both types of methods: those originally throwing an exception and method originally not throwing. I see this as an advantage. As for passing the exception in waitAgain args - this served two purposes: (1) debugging: it is a convenient single spot in the code to collect info on the exception that caused the retry, and also on the number of successive retries for the same original method call. (2) exception analysis - if one wants to analyze the exception root cause message for deciding if to wait or not, this is convenient location (although this would not allow to decide if to retry or not - because even if waitMore returns false there would be an additional try). So, when the debug is commented out it is possible to not pass the exception, and also to make trialsSoFar a boolean - but for debug purposes I would rather leave them there, I believe there is no real harm done here, particularly because this method is hardly ever called. Perhaps should mention in the waitMore javadoc that these args are for debug mainly? Chris Hostetter <[EMAIL PROTECTED]> wrote on 21/09/2006 12:10:56: > > The recurring pattern seems to be... > > ResultType methodName(ArgType args) throws ExceptionType { > int trialsSoFar = 0; > long maxTime = System.currentTimeMillis() + maxTotalDelay; > Exception error = null; > while (waitAgain(maxTime, trialsSoFar++, error)) { > try { > return super.methodName(args); > } catch (ExceptionType e) { > error = e; > } > } > return super.methodName(args); > } > > ...where the waitAgain method seems to take in more args then it really > needs (error is completley unused, and trialsSoFar is only need to know if > we are on the first trial) > > There may be a subltety i'm missing here, but it seems like this might be > more clearly (and susinctly) expressed with something like... > > ResultType methodName(ArgType args) throws ExceptionType { > long maxTime = System.currentTimeMillis() + maxTotalDelay; > while (true) { > try { > return super.methodName(args); > } catch (ExceptionType e) { > if (maxTime < System.currentTimeMillis()) throw e > } > wait(maxTime); > } > } > > ...where the wait method also get's simpler... > > static void wait(long maxTime) > long moreTime = maxTime - System.currentTimeMillis(); > long delay = Math.min(moreTime, intervalDelay); > try { > Thread.sleep(delay); > } catch (InterruptedException e1) { /* NOOP */ } > } > > ...but i haven't tried this, and as i said: there may be some subltety of > your approach that i'm missing. > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Clustering IndexWriter?
Fair question. All I did/need was take SegmentInfos and instead of subclassing Vector I made it contain a Vector. Went from subclassing to aggregation. As far as I could tell from reading the code it would make no difference to anyone and should have no performance impact (good or bad). It just allowed me to cluster the IndexWriter with a RAMDirectory. Maybe a little background would help. Our clustering product doesn't use java serialization and has no API. We just use a little config where one points us to what you want clustered and what java synchronization needs to be shared. One of the limitations that currently exists is that we don't support clustering subclasses of java collections. At this point I'm just experimenting to see if our product can cluster lucene in a useful/performant way. When my experimenting is complete, if everything is positive, I am going to write a blog on clustering lucene indexes but it would be awkward to do that if the people who run through the example have to change lucene code. Does this help? Cheers, Steve On 9/21/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Questions: : Is this useful in the real world : Would it be possible to get that one small thing changed. I'm not really clear on what the "small thing" is that you are asking about ... you mentioned SegmentInfos subclassing Vector, are you proposing an alternative? If you've got a patch that doesn't break existing functionality or have a negative impact on performance and makes lucene more usable in some way it would certainly be considered ... i'm just not really clear on what change you're suggesting and how it helps make Lucene more usable for you. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
help on Lock.obtain(lockWaitTimeout)
I'm working on a LockFactory that uses java.nio.* (OS native locks) for its locks. This should be a big help for people who keep finding their lock files left on disk due to abnormal shutdown, etc (because OS will free the locks, nomatter what, "in theory"). I thought I was nearly done but in testing the new LockFactory on an NFS server that didn't have locks properly configured (I think possibly a common situtation) I found a problem with how the Lock.obtain(lockWaitTimeout) works. That function precomputes how many times to try to obtain the lock (just divides lockWaitTimeout parameter and LOCK_POLL_INTERVAL) and then tries Lock.obtain() followed by a sleep of LOCK_POLL_INTERVAL, that many times, before timing out. The problem is, in the above test case: the call to Lock.obtain() can apparently take a long time (35 seconds, I assume some kind of underlying timeout contacting "lockd" from the NFS client) only to finally return "false". But the "try N times" approach makes the assumption that this call will take zero time. (In fact, as things stand now, when Lock.obtain() takes non-zero time, it causes the timeout to be longer than what was asked for; but likely this is typically a small amount?). Anyway, my first reaction was to change this to use System.currentTimeMillis() to measure elapsed time, but then I remembered is a dangerous approach because whenever the clock on the machine is updated (eg by a time-sync NTP client) it would mess up this function, causing it to either take longer than was asked for (if clock is moved backwards) or, to timeout in [much] less time than was asked for (if clock was moved forwards). I've hit such issues in the past and it's devilish. Timezone and daylight savings time don't matter because it's measuring GMT. So then what to do? What's the best way to change the function to "really" measure time? In Java 1.5 there is now a "nanoTime()" which is closer to what I need, but it's 1.5 (and we're still on 1.4), and apparently it can "fallback" to currentTimeMillis() on some platforms. In the past I've used separate a separate "clock" thread that just sleeps & increments a counter, but I don't really like the idea of spawning a whole new thread (Lucene doesn't launch its own threads now, except for ParallelMultiSearcher). Does anyone know of a good solution? Alternatively, since this is really a "misconfiguration" (ie the Lock.obtain() is never going to succeed), maybe we could try to obtain a random "test" lock on creation of the LockFactory, just to confirm that locking even "works" at all in the current environment, and then leave the current implementation of Lock.obtain() unchanged (when NFS locking is properly configured it seems to be fairly fast)? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
: Questions: : Is this useful in the real world : Would it be possible to get that one small thing changed. I'm not really clear on what the "small thing" is that you are asking about ... you mentioned SegmentInfos subclassing Vector, are you proposing an alternative? If you've got a patch that doesn't break existing functionality or have a negative impact on performance and makes lucene more usable in some way it would certainly be considered ... i'm just not really clear on what change you're suggesting and how it helps make Lucene more usable for you. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-665) temporary file access denied on Windows
The recurring pattern seems to be... ResultType methodName(ArgType args) throws ExceptionType { int trialsSoFar = 0; long maxTime = System.currentTimeMillis() + maxTotalDelay; Exception error = null; while (waitAgain(maxTime, trialsSoFar++, error)) { try { return super.methodName(args); } catch (ExceptionType e) { error = e; } } return super.methodName(args); } ...where the waitAgain method seems to take in more args then it really needs (error is completley unused, and trialsSoFar is only need to know if we are on the first trial) There may be a subltety i'm missing here, but it seems like this might be more clearly (and susinctly) expressed with something like... ResultType methodName(ArgType args) throws ExceptionType { long maxTime = System.currentTimeMillis() + maxTotalDelay; while (true) { try { return super.methodName(args); } catch (ExceptionType e) { if (maxTime < System.currentTimeMillis()) throw e } wait(maxTime); } } ...where the wait method also get's simpler... static void wait(long maxTime) long moreTime = maxTime - System.currentTimeMillis(); long delay = Math.min(moreTime, intervalDelay); try { Thread.sleep(delay); } catch (InterruptedException e1) { /* NOOP */ } } ...but i haven't tried this, and as i said: there may be some subltety of your approach that i'm missing. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436587 ] Otis Gospodnetic commented on LUCENE-675: - I still haven't gotten my employer to sign and fax the CCLA, so I'm stuck and can't contribute my search benchmark. I have a suggestion for a name for this - Lube, for Lucene Benchmark - contrib/lube. > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: NewIndexModifier.Sept21.patch This is to update the delete-support patch after the commit of the new merge policy. - Very few changes to IndexWriter. - The patch passes all tests. - A new test call TestNewIndexModifierDelete is added to show different scenarios when using delete methods in NewIndexModifier. > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Provided) > - > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, > IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, > NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, > NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, > newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, > perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > --- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > --- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > --- > Insert only 116 min 119 min116 min > Insert/delete (big batches) -- 135 min125 min > Insert/delete (small batches) -- 338 min134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interlea
Re: Re: Clustering IndexWriter?
I don't fully follow, and I don't even have the "it's late!" excuse. It sounds like you want to have the same index on multiple nodes in the cluster and when a data change occurs, you want to synchronously make the same change to all indices in your cluster. Is that it? Solr has a different approach. There, only the master index is modified, while slave servers copy the master index periodically. Otis - Original Message From: Steve Harris <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, September 21, 2006 11:18:43 AM Subject: Re: Re: Clustering IndexWriter? Sure, I'm fairly new to Lucene but what I was trying to do was make it so that an index could be shared among multiple nodes. If an index is updated in any way it would be updated across the cluster coherently. In my first version I was really only taking advantage of the fact that we detect fine grained changes and can extend synchronization across the cluster but if I can prove to myself that this is actually useful I'll go back and mark some of the synchronize blocks/methods as read locks to improve concurrency and reduce instrumentation to only what is needed. If I'm going to be able to publish the config for what I'm doing I would need to change that one class that I mentioned above becuase we won't support subclasses of collections for a few more months. I'm not a very good writer. Does any of that make sense? Summary would be: Goals, Usefully cluster luncene indexes for across multiple nodes. Questions: Is this useful in the real world Would it be possible to get that one small thing changed. Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On 9/20/06, Steve Harris <[EMAIL PROTECTED]> wrote: > > Is clustering the IndexWriter really all I need to do? > > Hi Steve, > Could you explain the details of what "clustering" really means in this > context? > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
Sure, I'm fairly new to Lucene but what I was trying to do was make it so that an index could be shared among multiple nodes. If an index is updated in any way it would be updated across the cluster coherently. In my first version I was really only taking advantage of the fact that we detect fine grained changes and can extend synchronization across the cluster but if I can prove to myself that this is actually useful I'll go back and mark some of the synchronize blocks/methods as read locks to improve concurrency and reduce instrumentation to only what is needed. If I'm going to be able to publish the config for what I'm doing I would need to change that one class that I mentioned above becuase we won't support subclasses of collections for a few more months. I'm not a very good writer. Does any of that make sense? Summary would be: Goals, Usefully cluster luncene indexes for across multiple nodes. Questions: Is this useful in the real world Would it be possible to get that one small thing changed. Cheers, Steve On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 9/20/06, Steve Harris <[EMAIL PROTECTED]> wrote: > Is clustering the IndexWriter really all I need to do? Hi Steve, Could you explain the details of what "clustering" really means in this context? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering IndexWriter?
On 9/20/06, Steve Harris <[EMAIL PROTECTED]> wrote: Is clustering the IndexWriter really all I need to do? Hi Steve, Could you explain the details of what "clustering" really means in this context? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Clustering IndexWriter?
Yep, that's us. No secret, just didn't want to make my question an billboard :-). Just needed a bit of info from the people who know best. Cheers, steve On 9/21/06, Vic Bancroft <[EMAIL PROTECTED]> wrote: adasal wrote: > Don't be coy, what's your comapany? This URL is derivable from the text, with a little search ening help . . . ** http://www.terracottatech.com/terracotta_spring.shtml more, l8r, v > On 21/09/06, Steve Harris <[EMAIL PROTECTED]> wrote: > >> >> Warning, I'm a vendor dude but this isn't really a vendor message. >> >> My IT guy had mentioned to me that a bunch of the open source products >> we use (JIRA, JForum etc) have Lucene inside and in the name of eating >> our own dog food >> I tried to cluster IndexWriter (with a RAMDirectory) using our >> (terracotta) clustering technology. >> >> Took me about a half hour to get the basics working from download >> time. I was wondering, do people in the real world want to be able to >> cluster this stuff? Is clustering the IndexWriter really all I need >> to do? >> >> If it is interesting, how do I feedback a small code change into the >> project. We don't yet support subclasses of collections and >> SegmentInfos subclasses Vector. I just turned it into aggregation >> (that took 10 of the 30 minutes). We will support this in a future >> release so it isn't a huge deal but I could get something out sooner >> if the change was made. >> >> Cheers, >> Steve >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > -- "The future is here. It's just not evenly distributed yet." -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436519 ] Grant Ingersoll commented on LUCENE-675: Yeah, ANT can do this, I think. Take a look at the DB contrib package, it downloads. I think I can setup the necessary stuff in contrib, if people think that is a good idea. First contribution will be this file and then we can go from there. I think Otis has run some perf. stuff too, but I am not sure if it can be contributed. I think someone else has really studied query perf. so it would be cool if that was added too. > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436518 ] Andrzej Bialecki commented on LUCENE-675: -- The dependency on commons-compress could be avoided - I used this just to be able to unpack tar.gz files, we can use Ant for that. If you meant the dependency on the corpus - can't Ant download this too as a dependency? Re: Project Gutenberg - good point, this is a good source for multi-lingual documents. The "Europarl" collection is another, although a bit more hefty, so that could be suitable for running large-scale benchmarks, and texts from Project Gutenberg for running small-scale tests. > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436516 ] Grant Ingersoll commented on LUCENE-675: Since this has dependencies, do you think we should put it under contrib? I would be for a Performance directory and we could then organize it from there. Perhaps into packages for quantitative and qualitative performance. > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering IndexWriter?
adasal wrote: Don't be coy, what's your comapany? This URL is derivable from the text, with a little search ening help . . . ** http://www.terracottatech.com/terracotta_spring.shtml more, l8r, v On 21/09/06, Steve Harris <[EMAIL PROTECTED]> wrote: Warning, I'm a vendor dude but this isn't really a vendor message. My IT guy had mentioned to me that a bunch of the open source products we use (JIRA, JForum etc) have Lucene inside and in the name of eating our own dog food I tried to cluster IndexWriter (with a RAMDirectory) using our (terracotta) clustering technology. Took me about a half hour to get the basics working from download time. I was wondering, do people in the real world want to be able to cluster this stuff? Is clustering the IndexWriter really all I need to do? If it is interesting, how do I feedback a small code change into the project. We don't yet support subclasses of collections and SegmentInfos subclasses Vector. I just turned it into aggregation (that took 10 of the 30 minutes). We will support this in a future release so it isn't a huge deal but I could get something out sooner if the change was made. Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- "The future is here. It's just not evenly distributed yet." -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene
[ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436502 ] Karl Wettin commented on LUCENE-675: It is also interesting to know how much time is consumed to assemble an instance of Document from the storage. According to my own tests this is the major reason to why InstantiatedIndex is so much faster than a FS/RAMDirectory. I also presume it to be the bottleneck of any RDBMS-, RMI- or any other "proxy"-based storage. > Lucene benchmark: objective performance test for Lucene > --- > > Key: LUCENE-675 > URL: http://issues.apache.org/jira/browse/LUCENE-675 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Andrzej Bialecki > Attachments: LuceneBenchmark.java > > > We need an objective way to measure the performance of Lucene, both indexing > and querying, on a known corpus. This issue is intended to collect comments > and patches implementing a suite of such benchmarking tests. > Regarding the corpus: one of the widely used and freely available corpora is > the original Reuters collection, available from > http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz > or > http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. > I propose to use this corpus as a base for benchmarks. The benchmarking > suite could automatically retrieve it from known locations, and cache it > locally. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering IndexWriter?
Don't be coy, what's your comapany? Adam On 21/09/06, Steve Harris <[EMAIL PROTECTED]> wrote: Warning, I'm a vendor dude but this isn't really a vendor message. My IT guy had mentioned to me that a bunch of the open source products we use (JIRA, JForum etc) have Lucene inside and in the name of eating our own dog food I tried to cluster IndexWriter (with a RAMDirectory) using our (terracotta) clustering technology. Took me about a half hour to get the basics working from download time. I was wondering, do people in the real world want to be able to cluster this stuff? Is clustering the IndexWriter really all I need to do? If it is interesting, how do I feedback a small code change into the project. We don't yet support subclasses of collections and SegmentInfos subclasses Vector. I just turned it into aggregation (that took 10 of the 30 minutes). We will support this in a future release so it isn't a huge deal but I could get something out sooner if the change was made. Cheers, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-443) ConjunctionScorer tune-up
[ http://issues.apache.org/jira/browse/LUCENE-443?page=comments#action_12436453 ] Paul Elschot commented on LUCENE-443: - I just overlooked the grant by Abdul to the ASF. > ConjunctionScorer tune-up > - > > Key: LUCENE-443 > URL: http://issues.apache.org/jira/browse/LUCENE-443 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 1.9 > Environment: Linux, Java 1.5, Large Index with 4 million items and > some heavily nested boolean queries >Reporter: Abdul Chaudhry > Attachments: Conjunction20060921.patch, ConjunctionScorer.java, > ConjunctionScorer.java > > > I just recently ran a load test on the latest code from lucene , which is > using a new BooleanScore and noticed the ConjunctionScorer was crunching > through objects , especially while sorting as part of the skipTo call. It > turns a linked list into an array, sorts the array, then converts the array > back to a linked list for further processing by the scoring engines below. > 'm not sure if anyone else is experiencing this as I have a very large index > (> 4 million items) and I am issuing some heavily nested queries > Anyway, I decide to change the link list into an array and use a first and > last marker to "simulate" a linked list. > This scaled much better during my load test as the java gargbage collector > was less - umm - virulent -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-443) ConjunctionScorer tune-up
[ http://issues.apache.org/jira/browse/LUCENE-443?page=all ] Paul Elschot updated LUCENE-443: Attachment: Conjunction20060921.patch Iirc the orginal performance problem was caused by creation of objects in the tight loop doing skipTo() on al the scorers. This patch is against current trunk and based on the earlier posted versions of ConjunctionScorer. which was based (by the first poster) on an existing ConjunctionScorer with an ASL notice, which is why I could grant the licence to ASF. > ConjunctionScorer tune-up > - > > Key: LUCENE-443 > URL: http://issues.apache.org/jira/browse/LUCENE-443 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 1.9 > Environment: Linux, Java 1.5, Large Index with 4 million items and > some heavily nested boolean queries >Reporter: Abdul Chaudhry > Attachments: Conjunction20060921.patch, ConjunctionScorer.java, > ConjunctionScorer.java > > > I just recently ran a load test on the latest code from lucene , which is > using a new BooleanScore and noticed the ConjunctionScorer was crunching > through objects , especially while sorting as part of the skipTo call. It > turns a linked list into an array, sorts the array, then converts the array > back to a linked list for further processing by the scoring engines below. > 'm not sure if anyone else is experiencing this as I have a very large index > (> 4 million items) and I am issuing some heavily nested queries > Anyway, I decide to change the link list into an array and use a first and > last marker to "simulate" a linked list. > This scaled much better during my load test as the java gargbage collector > was less - umm - virulent -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]