Re: [Pharo-dev] String deduplication

2014-05-30 Thread Marcus Denker

On 30 May 2014, at 09:39, Philippe Marschall  
wrote:

> Hi
> 
> This is an idea I stole from somebody else. The assumption is that you have a 
> lot of Strings in the image that are equal. We could therefore remove the 
> duplicates and make all the objects refer to the same instance.
> 
> However it's not a simple as that. The main issue is that String has two 
> responsibilities. The first is as an immutable value object. The second is as 
> a mutable character buffer for building immutable value objects. We must not 
> deduplicate the second kind. Unfortunately it's not straight forward to 
> figure out which kind a string is. The approach I took is looking at whether 
> it contains any 0 characters. An other option would be to check whether any 
> WirteStreams are referring to it.

One idea could be to have an explicit immutable string literal class (or to set 
the immutability for literals in the compiler when we have support for it for 
literals).
These then would be save to de-duplicate, would be interesting to now how large 
the percentage is of literals among all your de-douplicated strings.

When playing with first class references (that can in addition override 
behavior) we had the idea that one could use that for realising a general “copy 
on write” for any kind of object and even
whole object graphs… the idea would be to  search for object graphs that are 
the same and replace the all pointers with just pointer-objects that trap all 
writes to one singe
copy. 

In a second step one could combine that with “crystallising” unused subgraphs 
(that is, serialise and compress them in memory, with references to the graph 
on-demand and transparently 
decompress). Combined, this would save a lot of space.

But the devil is in the detail: how to find unused and equal subgraphs 
efficiently.

> Also, since there are behavioral differences between String and Symbol 
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the 
> heap and they compare #= true but we must not make anybody who refers to 
> 'hello' suddenly refer to #'hello').
> 
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 
> image. Sorry for the bad variable names.
> 
Nice!

Marcus


Re: [Pharo-dev] String deduplication

2014-05-30 Thread Clément Bera
Hello,

I like the idea but this is not as simple.

In some framework you may use different string with a same name as markers
that are not equals.

Typically:

Object>>#string1
^ 'string'

Object>>#string2
^ 'string'

Object>>#test
self assert: self string1 == self string1. "Answers true"
self assert: self string2 == self string2. "Answers true"
self assert: self string1 == self string2 "Answers false"

Frameworks relying on that will not work any more.

And this kind of bugs is not easy to spot, it typically crashes identity
collections in a non deterministic fashion.

Regards


2014-05-30 9:39 GMT+02:00 Philippe Marschall <
philippe.marsch...@netcetera.ch>:

> Hi
>
> This is an idea I stole from somebody else. The assumption is that you
> have a lot of Strings in the image that are equal. We could therefore
> remove the duplicates and make all the objects refer to the same instance.
>
> However it's not a simple as that. The main issue is that String has two
> responsibilities. The first is as an immutable value object. The second is
> as a mutable character buffer for building immutable value objects. We must
> not deduplicate the second kind. Unfortunately it's not straight forward to
> figure out which kind a string is. The approach I took is looking at
> whether it contains any 0 characters. An other option would be to check
> whether any WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in
> the heap and they compare #= true but we must not make anybody who refers
> to 'hello' suddenly refer to #'hello').
>
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3
> image. Sorry for the bad variable names.
>
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
> (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
> d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
> a value timesRepeat: [
> b remove: a key ]  ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> (b includes: s) ifTrue: [
> | l |
> l := m at: s ifAbsentPut: [ OrderedCollection new ].
> l add: s  ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
> | f |
> f := v at: 1.
> 2 to: v size do: [ :i |
> (v at: i) becomeForward: f ] ]
>
> Cheers
> Philippe
>
>
>


Re: [Pharo-dev] String deduplication

2014-05-30 Thread Sven Van Caekenberghe
Hmm, code that depends on 2 #= Strings to be not #== ?
That would either be a very special case but more likely a bug.
In any case, it won't be very common I guess.

The mutability/immutability is a way more important issue, unfixable unless we 
introduce a new class IMHO.

But saving 2MB is impressive.

On 30 May 2014, at 10:59, Clément Bera  wrote:

> Hello,
> 
> I like the idea but this is not as simple.
> 
> In some framework you may use different string with a same name as markers 
> that are not equals.
> 
> Typically:
> 
> Object>>#string1
> ^ 'string'
> 
> Object>>#string2
> ^ 'string'
> 
> Object>>#test
> self assert: self string1 == self string1. "Answers true"
> self assert: self string2 == self string2. "Answers true"
> self assert: self string1 == self string2 "Answers false"
> 
> Frameworks relying on that will not work any more.
> 
> And this kind of bugs is not easy to spot, it typically crashes identity 
> collections in a non deterministic fashion.
> 
> Regards
> 
> 
> 2014-05-30 9:39 GMT+02:00 Philippe Marschall 
> :
> Hi
> 
> This is an idea I stole from somebody else. The assumption is that you have a 
> lot of Strings in the image that are equal. We could therefore remove the 
> duplicates and make all the objects refer to the same instance.
> 
> However it's not a simple as that. The main issue is that String has two 
> responsibilities. The first is as an immutable value object. The second is as 
> a mutable character buffer for building immutable value objects. We must not 
> deduplicate the second kind. Unfortunately it's not straight forward to 
> figure out which kind a string is. The approach I took is looking at whether 
> it contains any 0 characters. An other option would be to check whether any 
> WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol 
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the 
> heap and they compare #= true but we must not make anybody who refers to 
> 'hello' suddenly refer to #'hello').
> 
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 
> image. Sorry for the bad variable names.
> 
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
> (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
> d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
> a value timesRepeat: [
> b remove: a key ]  ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> (b includes: s) ifTrue: [
> | l |
> l := m at: s ifAbsentPut: [ OrderedCollection new ].
> l add: s  ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
> | f |
> f := v at: 1.
> 2 to: v size do: [ :i |
> (v at: i) becomeForward: f ] ]
> 
> Cheers
> Philippe
> 
> 
> 




Re: [Pharo-dev] String deduplication

2014-05-30 Thread Marcus Denker

On 30 May 2014, at 10:59, Clément Bera  wrote:

> Hello,
> 
> I like the idea but this is not as simple.
> 
> In some framework you may use different string with a same name as markers 
> that are not equals.
> 
> Typically:
> 
> Object>>#string1
> ^ 'string'
> 
> Object>>#string2
> ^ 'string'
> 
> Object>>#test
> self assert: self string1 == self string1. "Answers true"
> self assert: self string2 == self string2. "Answers true"
> self assert: self string1 == self string2 "Answers false"
> 
> Frameworks relying on that will not work any more.
> 
> And this kind of bugs is not easy to spot, it typically crashes identity 
> collections in a non deterministic fashion.
> 

With an indirection (a kind of reference) that

-> points to the string
-> forwards everything, but does a copy on write on state change
-> implements == to return false


it would work. Of course you have then the same amount of objects(+1), but they 
would be all very 
small, thus leading to saving for large objects and especially when applied to 
subgraphs.

Marcus


> Regards
> 
> 
> 2014-05-30 9:39 GMT+02:00 Philippe Marschall 
> :
> Hi
> 
> This is an idea I stole from somebody else. The assumption is that you have a 
> lot of Strings in the image that are equal. We could therefore remove the 
> duplicates and make all the objects refer to the same instance.
> 
> However it's not a simple as that. The main issue is that String has two 
> responsibilities. The first is as an immutable value object. The second is as 
> a mutable character buffer for building immutable value objects. We must not 
> deduplicate the second kind. Unfortunately it's not straight forward to 
> figure out which kind a string is. The approach I took is looking at whether 
> it contains any 0 characters. An other option would be to check whether any 
> WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol 
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the 
> heap and they compare #= true but we must not make anybody who refers to 
> 'hello' suddenly refer to #'hello').
> 
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3 
> image. Sorry for the bad variable names.
> 
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
> (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
> d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
> a value timesRepeat: [
> b remove: a key ]  ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> (b includes: s) ifTrue: [
> | l |
> l := m at: s ifAbsentPut: [ OrderedCollection new ].
> l add: s  ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
> | f |
> f := v at: 1.
> 2 to: v size do: [ :i |
> (v at: i) becomeForward: f ] ]
> 
> Cheers
> Philippe
> 
> 
> 



Re: [Pharo-dev] String deduplication

2014-05-30 Thread Nicolas Cellier
I'm curious to see which method is really using mutable strings.
Of course, while constructing and write streaming on it, but then it's like
a temporary storage area and we don't have to care.
We're speaking of a place where a String would be modified to retain some
state...
Of course, since we can't exclude this possibility theoretically, the
proposed changed is unsafe.
But practically...


2014-05-30 13:46 GMT+02:00 Marcus Denker :

>
> On 30 May 2014, at 10:59, Clément Bera  wrote:
>
> Hello,
>
> I like the idea but this is not as simple.
>
> In some framework you may use different string with a same name as markers
> that are not equals.
>
> Typically:
>
> Object>>#string1
> ^ 'string'
>
> Object>>#string2
> ^ 'string'
>
> Object>>#test
> self assert: self string1 == self string1. "Answers true"
> self assert: self string2 == self string2. "Answers true"
> self assert: self string1 == self string2 "Answers false"
>
> Frameworks relying on that will not work any more.
>
> And this kind of bugs is not easy to spot, it typically crashes identity
> collections in a non deterministic fashion.
>
>
> With an indirection (a kind of reference) that
>
> -> points to the string
> -> forwards everything, but does a copy on write on state change
> -> implements == to return false
>
>
> it would work. Of course you have then the same amount of objects(+1), but
> they would be all very
> small, thus leading to saving for large objects and especially when
> applied to subgraphs.
>
> Marcus
>
>
> Regards
>
>
> 2014-05-30 9:39 GMT+02:00 Philippe Marschall <
> philippe.marsch...@netcetera.ch>:
>
>> Hi
>>
>> This is an idea I stole from somebody else. The assumption is that you
>> have a lot of Strings in the image that are equal. We could therefore
>> remove the duplicates and make all the objects refer to the same instance.
>>
>> However it's not a simple as that. The main issue is that String has two
>> responsibilities. The first is as an immutable value object. The second is
>> as a mutable character buffer for building immutable value objects. We must
>> not deduplicate the second kind. Unfortunately it's not straight forward to
>> figure out which kind a string is. The approach I took is looking at
>> whether it contains any 0 characters. An other option would be to check
>> whether any WirteStreams are referring to it.
>> Also, since there are behavioral differences between String and Symbol
>> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in
>> the heap and they compare #= true but we must not make anybody who refers
>> to 'hello' suddenly refer to #'hello').
>>
>> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3
>> image. Sorry for the bad variable names.
>>
>> | b d m |
>> b := Bag new.
>> d := OrderedCollection new.
>> m := Dictionary new.
>> "count all string instances"
>> String allSubInstancesDo: [ :s |
>> s isSymbol ifFalse: [
>> b add: s ] ].
>> "find the ones that have no duplicates or are likely buffers"
>> b doWithOccurrences: [ :s :i |
>> (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
>> d add: s -> i ] ].
>> "remove the ones that have no duplicates or are likely buffers"
>> d do: [ :a |
>> a value timesRepeat: [
>> b remove: a key ]  ].
>> "map all duplicate strings to their duplicates"
>> String allSubInstancesDo: [ :s |
>> s isSymbol ifFalse: [
>> (b includes: s) ifTrue: [
>> | l |
>> l := m at: s ifAbsentPut: [ OrderedCollection new ].
>> l add: s  ] ].
>> "remove the duplicates"
>> m keysAndValues do [ :k :v |
>> | f |
>> f := v at: 1.
>> 2 to: v size do: [ :i |
>> (v at: i) becomeForward: f ] ]
>>
>> Cheers
>> Philippe
>>
>>
>>
>
>


Re: [Pharo-dev] String deduplication

2014-05-30 Thread Chris Muller
I hope you're only considering this one-time for your Pharo release
image, and not something that will _continue_ to operating on an
on-going basis.  Attempting to do that below app-layer would be way
too intrusive for, for example, a Magma application..A String read
from one DB session belongs to that session.  If it were canonicalized
and suddenly belonged to two sessions, then mutating it would end up
committing that change in both sessions.  Bad.

Strings aren't the only type to share.  What about Dates?  And
literal-Array's?  Timezones and Durations?  This is an
app-responsibility, not a system one.  A one-time compress prior to
release is okay, but not automatically behind the scenes..


On Fri, May 30, 2014 at 2:39 AM, Philippe Marschall
 wrote:
> Hi
>
> This is an idea I stole from somebody else. The assumption is that you have
> a lot of Strings in the image that are equal. We could therefore remove the
> duplicates and make all the objects refer to the same instance.
>
> However it's not a simple as that. The main issue is that String has two
> responsibilities. The first is as an immutable value object. The second is
> as a mutable character buffer for building immutable value objects. We must
> not deduplicate the second kind. Unfortunately it's not straight forward to
> figure out which kind a string is. The approach I took is looking at whether
> it contains any 0 characters. An other option would be to check whether any
> WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the
> heap and they compare #= true but we must not make anybody who refers to
> 'hello' suddenly refer to #'hello').
>
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3
> image. Sorry for the bad variable names.
>
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
> (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
> d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
> a value timesRepeat: [
> b remove: a key ]  ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> (b includes: s) ifTrue: [
> | l |
> l := m at: s ifAbsentPut: [ OrderedCollection new ].
> l add: s  ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
> | f |
> f := v at: 1.
> 2 to: v size do: [ :i |
> (v at: i) becomeForward: f ] ]
>
> Cheers
> Philippe
>
>



Re: [Pharo-dev] String deduplication

2014-06-02 Thread Philippe Marschall

On 30.05.14 17:12, Chris Muller wrote:

I hope you're only considering this one-time for your Pharo release
image, and not something that will _continue_ to operating on an
on-going basis.


Of course. Besides all the problems it's also way too slow.

Cheers
Philippe





Re: [Pharo-dev] String deduplication

2014-06-02 Thread Christophe Demarey

Le 30 mai 2014 à 09:39, Philippe Marschall a écrit :

> Hi
> 
> This is an idea I stole from somebody else. The assumption is that you have a 
> lot of Strings in the image that are equal. We could therefore remove the 
> duplicates and make all the objects refer to the same instance.


I worked on a String optimisation for a Java virtual machine dedicated to small 
hardwares.
A string was represented by an array of bytes (UTF8 or ASCII encoding), a start 
position, and the number of characters of the string.
It allows to reuse the internal byte array for different strings but the String 
object was different for each String.
With this approach, you are able to save a lot of memory (but with some CPU 
overhead) and you don't have a problems because you have different String 
objects for each String.

ex: b1: 'Hello World! It's a sunny day'

'Hello World! It's a sunny day' : start = 0, count = 29, value b1
'Hello' : start = 0, count = 5, value b1
'Hello World!' : start = 0, count = 12, value b1
'World': start = 6, count = 5, value b1


I don't see how it may be applied to Smalltalk and it makes sense.

Christophe.



smime.p7s
Description: S/MIME cryptographic signature


Re: [Pharo-dev] String deduplication

2014-06-02 Thread Chris Muller
This is a normal pattern in application-development.  For example,
Todd Blanchard's HTML validating parser grabs a huge chunk of HTML,
and all the parts are simply referenced by position within that big
String.

On Mon, Jun 2, 2014 at 8:26 AM, Christophe Demarey
 wrote:
>
> Le 30 mai 2014 à 09:39, Philippe Marschall a écrit :
>
>> Hi
>>
>> This is an idea I stole from somebody else. The assumption is that you have 
>> a lot of Strings in the image that are equal. We could therefore remove the 
>> duplicates and make all the objects refer to the same instance.
>
>
> I worked on a String optimisation for a Java virtual machine dedicated to 
> small hardwares.
> A string was represented by an array of bytes (UTF8 or ASCII encoding), a 
> start position, and the number of characters of the string.
> It allows to reuse the internal byte array for different strings but the 
> String object was different for each String.
> With this approach, you are able to save a lot of memory (but with some CPU 
> overhead) and you don't have a problems because you have different String 
> objects for each String.
>
> ex: b1: 'Hello World! It's a sunny day'
>
> 'Hello World! It's a sunny day' : start = 0, count = 29, value b1
> 'Hello' : start = 0, count = 5, value b1
> 'Hello World!' : start = 0, count = 12, value b1
> 'World': start = 6, count = 5, value b1
>
>
> I don't see how it may be applied to Smalltalk and it makes sense.
>
> Christophe.
>