Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller
The implementation of Interner doesn't look threadsafe to me. At the very 
least shouldn't 'pool' be marked volatile, otherwise the result of 'pool 
= newPool' is not guaranteed to be visible to other threads?



[
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j
ira.plugin.system.issuetabpanels:all-tabpanel ]

Earwin Burrfoot updated LUCENE-1607:

Attachment: intern.patch


String.intern() faster alternative
--
Key: LUCENE-1607
URL: https://issues.apache.org/jira/browse/LUCENE-1607
Project: Lucene - Java
Issue Type: Improvement
Reporter: Earwin Burrfoot
Fix For: 2.9
Attachments: intern.patch

By using our own interned string pool on top of default,
String.intern() can be greatly optimized.

On my setup (java 6) this alternative runs ~15.8x faster for already
interned strings, and ~2.2x faster for 'new String(interned)'

For java 5 and 4 speedup is lower, but still considerable.






-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Updated: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller
This implementation suffers from thread visibility problems too - changes 
to the array's values aren't guaranteed to be visible across threads. In 
addition to that, there's also a problem with hash collisions invalidating 
cache entries which could greatly degrade performance in several common use 
cases. For example, suppose we had a nested loop iterating docs and the doc's 
field names, interning the names as we went. If two fields (F1, F2) both 
hashed to the same array index the cache would never be hit since we'd be 
alternating between interning F1 and F2. Without benchmarking/testing it's 
hard to know how big a problem that would be in practice, but the thread 
visibility problem seems potentially serious.




[
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j
ira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated LUCENE-1607:
-
Attachment: LUCENE-1607.patch

Here's a completely lockless and memory barrier free intern() cache.

This default would be more back compatible since programs may rely on
String instances being interned via String.intern().

It does not yet include corresponding Lucene code changes to use the
StringInterner.

Thoughts?


String.intern() faster alternative
--
Key: LUCENE-1607
URL: https://issues.apache.org/jira/browse/LUCENE-1607
Project: Lucene - Java
Issue Type: Improvement
Reporter: Earwin Burrfoot
Fix For: 2.9
Attachments: intern.patch, LUCENE-1607.patch

By using our own interned string pool on top of default,
String.intern() can be greatly optimized.

On my setup (java 6) this alternative runs ~15.8x faster for already
interned strings, and ~2.2x faster for 'new String(interned)'

For java 5 and 4 speedup is lower, but still considerable.






-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller
As far as I can see, both these implementations only suffer from threadsafety 
problems in that they don't guarantee visibility across threads, ie it's 
possible for threads to see stale data. I don't see any prospect of corruption 
or race conditions due to out-of-order execution. So the code should work 
fine if you can live with the consequences of stale data - in this case, 
the (remote) possibility of large performance differences between VMs. Personally 
I tend to avoid such fragile and hard to maintain code unless there's a very 
good reason for it.


How about benchmarking with eg a ConcurrentHashMap instead?



[
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.j
ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=127
00600#action_12700600 ]

Mark Miller commented on LUCENE-1607:
-
bq. Earwin, I took a quick look at your implementation just now, but
it doesn't look thread-safe.

That was my first impression too, but I couldnt pin down the issue.
The access will either be against the old pool, or it will be against
the new pool, and the instance switch should be atomic? I figured it
was a clever trick of some kind (though I did wonder about the cost of
making the new hashmap every add). The HashMaps are read only right
(once they can be accessed by multiple threads)? And they are popped
in with an atomic variable assignment?


String.intern() faster alternative
--
Key: LUCENE-1607
URL: https://issues.apache.org/jira/browse/LUCENE-1607
Project: Lucene - Java
Issue Type: Improvement
Reporter: Earwin Burrfoot
Fix For: 2.9
Attachments: intern.patch, LUCENE-1607.patch

By using our own interned string pool on top of default,
String.intern() can be greatly optimized.

On my setup (java 6) this alternative runs ~15.8x faster for already
interned strings, and ~2.2x faster for 'new String(interned)'

For java 5 and 4 speedup is lower, but still considerable.






-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller

How about benchmarking with eg a ConcurrentHashMap instead?


Scratch that, I forgot about the 1.3/1.4 dependency...




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller

As soon as all possible fields are in the pool, we're essentially
readonly.


The problem is, there's no guarantee we will ever reach this point. For example 
suppose you have a server app that spawns a new thread per request. Each 
new thread might have to make all the .intern() calls again because they 
never see anything in the cache. Having said that, I agree that the code 
will still work correctly regardless, and this is a very unlikely scenario 
anyway - for most practical situations this is never going to be an issue.





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller
Sorry I wasn't as clear as I could have been - I realise JEE servers use 
a threadpool for handling requests, I was thinking of many other applications 
in the real world I'm aware of that don't (be that good design or otherwise...). 
I'm not sure I follow you though when you say "it won't even do a write" 
on a cache miss, it copies the (potentially stale) HashMap in that situation 
does it not?


I still think there are some *theoretical* dangers (eg a thread replacing 
a well populated cache with a stale/empty copy). Just consider me pessimistic 
after having been burnt in the past by similar 'impossibilities' ;-) Overall 
though I'm in agreement that there aren't any practical issues with this 
code and in fact it'll give a good performance boost over String.intern() 
even in the worst case anyway! I was only trying to raise the stale/visibility 
issue but it's clear you've already given that plenty of thought.




On Sun, Apr 19, 2009 at 23:42, Chris Miller 
wrote:


As soon as all possible fields are in the pool, we're essentially
readonly.


The problem is, there's no guarantee we will ever reach this point.
For example suppose you have a server app that spawns a new thread
per request. Each new thread might have to make all the .intern()
calls again because they never see anything in the cache.


No sane server app is actually spawning a new thread per request, it
takes one from pool. A spawned thread inherits visibility of its
parent thread (at the moment of spawning). Even if a new thread still
sees stale pool, it'll be updated on first cache miss. It won't even
do a write. So, no "all the intern calls again", such scenario doesn't
exist even in theory.






-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-04-19 Thread Chris Miller

You was. I just wanted to point out that in real apps you're not going
to see stale data longer than for milliseconds


Agreed, which is why this whole discussion is very theoretical anyway :)


On cache miss, I re-retrieve pool reference after a lock (HashMap is
no longer stale), re-read a string, and do the write only if the
string is still not there.


Ah I see what you're saying now, I had been overlooking the re-retrieval 
of the pool reference. Thanks for clarifying that for me.



The truth is born in argument. I reread jmm docs to be sure, but I
can't guarantee I understood them well.


Seems like you have a pretty good understanding to me :)




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Greetings and questions about patches

2009-04-22 Thread Chris Miller

Issues:
1> none of these methods is ever called.


Note that Yonik's suggested patch for LUCENE-1607 contains the following 
code:


+  public SimpleStringInterner(int sz) {
+cache = new String[BitUtil.nextHighestPowerOfTwo(sz)];
+  }

...so the int flavour of nextHighestPowerOfTwo() might be in use shortly! :-)





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org