Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative
The implementation of Interner doesn't look threadsafe to me. At the very least shouldn't 'pool' be marked volatile, otherwise the result of 'pool = newPool' is not guaranteed to be visible to other threads? [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j ira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1607: Attachment: intern.patch String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Fix For: 2.9 Attachments: intern.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative
This implementation suffers from thread visibility problems too - changes to the array's values aren't guaranteed to be visible across threads. In addition to that, there's also a problem with hash collisions invalidating cache entries which could greatly degrade performance in several common use cases. For example, suppose we had a nested loop iterating docs and the doc's field names, interning the names as we went. If two fields (F1, F2) both hashed to the same array index the cache would never be hit since we'd be alternating between interning F1 and F2. Without benchmarking/testing it's hard to know how big a problem that would be in practice, but the thread visibility problem seems potentially serious. [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j ira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-1607: - Attachment: LUCENE-1607.patch Here's a completely lockless and memory barrier free intern() cache. This default would be more back compatible since programs may rely on String instances being interned via String.intern(). It does not yet include corresponding Lucene code changes to use the StringInterner. Thoughts? String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Fix For: 2.9 Attachments: intern.patch, LUCENE-1607.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative
As far as I can see, both these implementations only suffer from threadsafety problems in that they don't guarantee visibility across threads, ie it's possible for threads to see stale data. I don't see any prospect of corruption or race conditions due to out-of-order execution. So the code should work fine if you can live with the consequences of stale data - in this case, the (remote) possibility of large performance differences between VMs. Personally I tend to avoid such fragile and hard to maintain code unless there's a very good reason for it. How about benchmarking with eg a ConcurrentHashMap instead? [ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=127 00600#action_12700600 ] Mark Miller commented on LUCENE-1607: - bq. Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe. That was my first impression too, but I couldnt pin down the issue. The access will either be against the old pool, or it will be against the new pool, and the instance switch should be atomic? I figured it was a clever trick of some kind (though I did wonder about the cost of making the new hashmap every add). The HashMaps are read only right (once they can be accessed by multiple threads)? And they are popped in with an atomic variable assignment? String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Fix For: 2.9 Attachments: intern.patch, LUCENE-1607.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative
How about benchmarking with eg a ConcurrentHashMap instead? Scratch that, I forgot about the 1.3/1.4 dependency... - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative
As soon as all possible fields are in the pool, we're essentially readonly. The problem is, there's no guarantee we will ever reach this point. For example suppose you have a server app that spawns a new thread per request. Each new thread might have to make all the .intern() calls again because they never see anything in the cache. Having said that, I agree that the code will still work correctly regardless, and this is a very unlikely scenario anyway - for most practical situations this is never going to be an issue. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative
Sorry I wasn't as clear as I could have been - I realise JEE servers use a threadpool for handling requests, I was thinking of many other applications in the real world I'm aware of that don't (be that good design or otherwise...). I'm not sure I follow you though when you say "it won't even do a write" on a cache miss, it copies the (potentially stale) HashMap in that situation does it not? I still think there are some *theoretical* dangers (eg a thread replacing a well populated cache with a stale/empty copy). Just consider me pessimistic after having been burnt in the past by similar 'impossibilities' ;-) Overall though I'm in agreement that there aren't any practical issues with this code and in fact it'll give a good performance boost over String.intern() even in the worst case anyway! I was only trying to raise the stale/visibility issue but it's clear you've already given that plenty of thought. On Sun, Apr 19, 2009 at 23:42, Chris Miller wrote: As soon as all possible fields are in the pool, we're essentially readonly. The problem is, there's no guarantee we will ever reach this point. For example suppose you have a server app that spawns a new thread per request. Each new thread might have to make all the .intern() calls again because they never see anything in the cache. No sane server app is actually spawning a new thread per request, it takes one from pool. A spawned thread inherits visibility of its parent thread (at the moment of spawning). Even if a new thread still sees stale pool, it'll be updated on first cache miss. It won't even do a write. So, no "all the intern calls again", such scenario doesn't exist even in theory. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative
You was. I just wanted to point out that in real apps you're not going to see stale data longer than for milliseconds Agreed, which is why this whole discussion is very theoretical anyway :) On cache miss, I re-retrieve pool reference after a lock (HashMap is no longer stale), re-read a string, and do the write only if the string is still not there. Ah I see what you're saying now, I had been overlooking the re-retrieval of the pool reference. Thanks for clarifying that for me. The truth is born in argument. I reread jmm docs to be sure, but I can't guarantee I understood them well. Seems like you have a pretty good understanding to me :) - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Greetings and questions about patches
Issues: 1> none of these methods is ever called. Note that Yonik's suggested patch for LUCENE-1607 contains the following code: + public SimpleStringInterner(int sz) { +cache = new String[BitUtil.nextHighestPowerOfTwo(sz)]; + } ...so the int flavour of nextHighestPowerOfTwo() might be in use shortly! :-) - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org