Hi Shai,
Thanks for the insight. I still disagree about the lock file placement. If your application needs such functionality you can always program your own lock factory. But even if we add support for external dirs, the creation of the lock file name and prefix should be done inside the lock factory (currently, the lock id - the MD5 - is created in the directory instance). This has pros for e.g. RAMDirectory, but is just unclean at the moment. Especially the special handling of the lock id when the directory path for the lock file differs from the index directory. About MD5: At least for Lucene 3.x we must stay with MD5, as changing the hash algorithm would change the lock file name, so an index locked with 3.0 would have a different lock file than an index locked with 3.x (for external lock directory). This would lead to problems for users upgrading several servers writing to shared indexes during the upgrades. You cannot be sure that the same index is always locked when you have 2 different lucene versions. This is just a warning, maybe it's not important, but its important. I would propose the following (for 3.x): Remove the static MessageDigest from FSDirectory and replace it by a lazy-init one. Because for the common case, the message digest is never used and should not be initialized. On first use, create one (which only happens if you specify a separate lock directory). Normal Lucene users and Solr will never hit this. We can stay with MD5. About availability: As nitpicked yesterday with Robert, we can still assume MD5 is always available for any JVM, because without MD5, tons of software applications will break (just think of sessions in web app servers). Harmony has it of course, Android. So we should simply use it (and please don't copy an MD5 impl to util!) and throw exception when not available. For this seldom case (normal user have their locks in index dir), this is fine. Everything else would break safe upgrade pathes. I think we should open an issue and discuss about the LockFactory simplyfications (the code is currently very broken as it relies on strange sophisticated backwards assumptions, just my opinion). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de/> http://www.thetaphi.de eMail: [email protected] From: Shai Erera [mailto:[email protected]] Sent: Sunday, December 26, 2010 7:43 AM To: [email protected] Subject: Re: LuceneTestCase.threadCleanup incorrectly reports left running threads Thanks for the explanation Uwe. I got confused with that and the process name we add to the temp lock file. About what you say, we have a scenario where we use a FSLockFactory outside an IndexWriter scope, but still within a search application scope. We lock a directory so that several processes can be synced on it. Another use case is when you manage several indexes, that should be locked at once -- the lock file may reside external to all of them. So I would not want to see LockFactory suddenly assuming and enforcing the lock file to be created in the index directory, or rely on IndexWriter passing it something etc. I think the way LF works today, allowing you to place the lock file wherever you want gives more freedom to people and opens up the door for more use cases. I don't think though that we should rely on MD5. A simple hash function should be enough IMO. Shai On Sun, Dec 26, 2010 at 12:41 AM, Uwe Schindler <[email protected]> wrote: Hi Shai, the md5 hash generated has nothing to do with concurrency anymore (the concurrency thing was this NativeFSLock test method already removed). The thing is the following: In early lucene versions, the lock files were put into TEMP directory. Later the lock factories allowed, to put the lock files into arbitrary folders. For these both cases, the lock file name got an MD5 hash of the index directory appended/prepended. In later Lucene versions the default for lock files was changed to be the index folder. For backwards compatibility reasons, with 2.9 and 3.0 you still had the possibility to instantiate a LockFactory using a non-null path (using the ctor with a directory name). FSLockFactory was programmed to support both cases (null directory or explicit directory). When the lock directory is the same like the index directory, the lock file got no hash appended. For the rare case that somebody used a different folder (e.g. a temp directory), FSLockFactory was falling back to the "old" behavior of adding the hash to the lock file name. The magic for the md5 magic lock prefix is done if FSDirectory#setLockFactory(). It checks for lockFactory extends FSLockFactory and if yes then checks, that the LockFactories path name is the same like the FSDir's or null. In that case it sets the lock prefix to null. Otherwise the lock prefix is generated by calling the magic MD5 creating method (Directory#getLockId()). In my opinion, in 3.x we should deprecate the separate path for the lock file (Directory#getLockId()) and enforce the lockfile always to be placed in the index dir. LockFactory should not get a directory at all, but instead should get the index dir on locking. For FS locks it would place the write.lock file in the supplied folder and for other locks (like per-JVM locks for RAMDirs) it could e.g. lookup the index dir in some map or whatever. To place the lockfile somewhere else, you should be able to use FileSwitchDirectory (currently not possible). Most tests in Lucene use the default (null lock dir in LockFactory), but some tests for SimpleFSLockFactory & Co use the explicit directory names and therefore generate MD5 hashes to test the special behavior. For compatibility reasons we have to still use MD5 (to prevent different lock file names after Lucene upgrade when FSDir is locked by another JVM with older Lucene version). For 4.0 I would remove this stupidity and only allow lock files in index directory. I hope I explained this stuff so everybody understand it, its really a little bit confusing (how its implemented), but its "sophisticated backwards" (haha). I would like to get rid of it and then we have no digest code anymore. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Shai Erera [mailto:[email protected]] > Sent: Saturday, December 25, 2010 3:04 PM > To: [email protected] > Subject: Re: LuceneTestCase.threadCleanup incorrectly reports left running > threads > > Actually, the MD5 thingy is an attempt to generate a unique temp lock ID, > IIRC. so this piece of code can disappear entirely now that the tests > concurrency is better. > > As for the other threads that are left running, I couldn't track down yet the > warning from the benchmark tests, but I'd love to get rid of those false > warnings. I thought the stack trace could at least tell us who spawned the > thread, but obviously it's not always clear. > > Shai > > On Saturday, December 25, 2010, Robert Muir <[email protected]> wrote: > > On Sat, Dec 25, 2010 at 4:04 AM, Uwe Schindler <[email protected]> > wrote: > >> Md5 is guaranteed to be there (like utf8 as charset). This is documented in > crypto Api, which algorithms are available for digest. > >> > > > > where is this documented? its not in the javadocs. > > > > anyway, we shouldn't be doing this: > > * this algorithm might not exist on J2ME etc (still java), you need to > > install an extra crypto add-on. > > * we shouldnt start up an expensive PKI infrastructure on mac os X, > > including spawning a new thread, just to hash a string. thats absurd. > > * we pay all these costs ... for md5! its not even a good hash! > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] For > > additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] For additional > commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
