Hmm, code that depends on 2 #= Strings to be not #== ? That would either be a very special case but more likely a bug. In any case, it won't be very common I guess.
The mutability/immutability is a way more important issue, unfixable unless we introduce a new class IMHO. But saving 2MB is impressive. On 30 May 2014, at 10:59, Clément Bera <bera.clem...@gmail.com> wrote: > Hello, > > I like the idea but this is not as simple. > > In some framework you may use different string with a same name as markers > that are not equals. > > Typically: > > Object>>#string1 > ^ 'string' > > Object>>#string2 > ^ 'string' > > Object>>#test > self assert: self string1 == self string1. "Answers true" > self assert: self string2 == self string2. "Answers true" > self assert: self string1 == self string2 "Answers false" > > Frameworks relying on that will not work any more. > > And this kind of bugs is not easy to spot, it typically crashes identity > collections in a non deterministic fashion. > > Regards > > > 2014-05-30 9:39 GMT+02:00 Philippe Marschall > <philippe.marsch...@netcetera.ch>: > Hi > > This is an idea I stole from somebody else. The assumption is that you have a > lot of Strings in the image that are equal. We could therefore remove the > duplicates and make all the objects refer to the same instance. > > However it's not a simple as that. The main issue is that String has two > responsibilities. The first is as an immutable value object. The second is as > a mutable character buffer for building immutable value objects. We must not > deduplicate the second kind. Unfortunately it's not straight forward to > figure out which kind a string is. The approach I took is looking at whether > it contains any 0 characters. An other option would be to check whether any > WirteStreams are referring to it. > Also, since there are behavioral differences between String and Symbol > besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the > heap and they compare #= true but we must not make anybody who refers to > 'hello' suddenly refer to #'hello'). > > Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 > image. Sorry for the bad variable names. > > | b d m | > b := Bag new. > d := OrderedCollection new. > m := Dictionary new. > "count all string instances" > String allSubInstancesDo: [ :s | > s isSymbol ifFalse: [ > b add: s ] ]. > "find the ones that have no duplicates or are likely buffers" > b doWithOccurrences: [ :s :i | > (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [ > d add: s -> i ] ]. > "remove the ones that have no duplicates or are likely buffers" > d do: [ :a | > a value timesRepeat: [ > b remove: a key ] ]. > "map all duplicate strings to their duplicates" > String allSubInstancesDo: [ :s | > s isSymbol ifFalse: [ > (b includes: s) ifTrue: [ > | l | > l := m at: s ifAbsentPut: [ OrderedCollection new ]. > l add: s ] ]. > "remove the duplicates" > m keysAndValues do [ :k :v | > | f | > f := v at: 1. > 2 to: v size do: [ :i | > (v at: i) becomeForward: f ] ] > > Cheers > Philippe > > >