> On 18 Feb 2022, at 21:25, Guillermo Polito <guillermopol...@gmail.com> wrote: > > Thanks Sven, great stuff :)
Thanks! This allows you to easily play/explore/experiment with certain ideas. In the past we discussed about the option of bringing source code inside the image, what if we applied compression ? The total size of all Object methods is about 100k: Object allMethods sum: [ :each | each sourceCode size ]. "104633" We can compress each individual method as an LZ4 block and see what that gives us. LZ4Compressor new in: [ :compressor | Object allMethods sum: [ :each | | compressed | compressed := compressor compressBlock: each sourceCode utf8Encoded. compressed size ] ]. "81584" (104633/81584) reciprocal asFloat. "0.7797157684478129" That is about 22% smaller. This is not a very good result. But that is to be expected because methods are small and there is often not much to compress. If we concatenate all source code and feed that as one big chunk to the compressor we get much better results. (LZ4Compressor new compress: (String streamContents: [ :out | Object allMethods do: [ :each | out nextPutAll: each sourceCode ] ]) utf8Encoded) size. "53544" (104633/53544) reciprocal asFloat. "0.5117314805080615" Now we get an almost 50% reduction in size. But methods are independent, so that is not an option. What if we used a dictionary, a predefined set of words/substrings that are common in source code. I found a list of the 500 most common English words. Let's add some common selectors and globals. IdentityBag new in: [ :bag | SystemNavigation default allMethods do: [ :each | each literals select: [ :x | x isSymbol ] thenDo: [ :x | bag add: x ] ]. bag sortedCounts select: [ :x | x key > 100 ] ]. IdentityBag new in: [ :bag | SystemNavigation default allMethods do: [ :each | each literals select: [ :x | x isVariableBinding ] thenDo: [ :x | bag add: x key ] ]. bag sortedCounts select: [ :x | x key > 100 ] ]. The smallest possible match in LZ4 is 4 bytes (3 letters and a space). words := Character space join: (((FileLocator desktop / 'en-500.csv' readStreamDo: [ :in | (NeoCSVReader on: in) addIgnoredField; addField; upToEnd ]) collect: #first) select: [ :each | each size > 2 ]). That are 473 words. Next are 137 selectors. selectors := ' ifTrue: assert: class assert:equals: ifTrue:ifFalse: ifNil: ifFalse: yourself name and: ifNotNil: traitComposition add: first deny: includes: isEmpty asString nextPutAll: with: isNil collect: initialize subclassResponsibility to:do: selector localMethodDict should:raise: theme notNil printString on:do: streamContents: at:ifAbsent: copy contents error: last model default ifNil:ifNotNil: organization skipOrReturnWith:ifSkippable: current parserExceptions nonEmpty select: asSymbol name: readStream includesKey: basicNew title: empty reject: whileTrue: keys space class: extent: close anySatisfy: parse:documentURI: isLocalSelector: traitSource second position print: whileFalse: asArray format: printOn: selectors isKindOf: copyFrom:to: color: shouldnt:raise: width height max: named: signal hasProperty: anyOne text detect:ifNone: label: ensure: ifEmpty: extent text: entity addAll: negated includesLocalSelector: addSelector:withMethod: traitDefining:ifNone: hash addSelector:on: asInteger min: translated iconNamed: method arguments position: withIndexDo: perform: methods delete url: occurrencesOf: selector: hResizing: with:with: pass notEmpty flag: values removeKey: fromString: classNamed: reset changed removeKey:ifAbsent: width: announce: repository: signal: setUp addLast: session uniqueInstance assert:description: asOrderedCollection compiledMethod assert:gives: '. Finally 73 globals. globals := ' String OrderedCollection Array Smalltalk Color Error Character Dictionary TraitChange ByteArray Object UIManager DateAndTime Set Form Time ZTimestamp Date RBParser World Protocol Duration Processor SAXHandler Display MetaLink HelpTopic ReflectivityExamples IdentitySet OCOpalExamples GLMTabulator Float SpecLayout UUID WAMimeType SystemAnnouncer STON XMLDOMParser ZnMimeType Transcript ZnEntity ExternalType CompiledMethod GRPlatform Semaphore FileSystem ReadWriteStream ZnClient WriteStream Delay CmdContextMenuActivation ZnResponse FileLocator IdentityDictionary Morph MCSnapshot ReflectiveMethod XMLValidationException Integer MCMethodDefinition Path ClyClassScope MCClassDefinition RBCondition MCVersionInfo MCVersion SmallInteger Cursor TraitedClass GoferVersionReference SortedCollection MCOrganizationDefinition XMLWellFormednessException '. dictionary := (globals , words , selectors) utf8Encoded. This dictionary is less than 5K. (LZ4Compressor new dictionary: dictionary) in: [ :compressor | Object allMethods sum: [ :each | | compressed | compressed := compressor compressBlock: each sourceCode utf8Encoded. compressed size ] ]. "69146" (104633/69146) reciprocal asFloat. "0.6608431374422983" Now we get a 33% reduction in size, which is better. I am sure that with a more carefully, better tuned dictionary the compression rate could be improved a couple of percent. There also exist tools that can compute an optimal dictionary from a given input set. Sorry for the long post, I hope at least someone found this interesting. Sven > Envoyé depuis mon téléphone Huawei > > > -------- Message original -------- > De : Sven Van Caekenberghe <s...@stfx.eu> > Date : ven. 18 févr. 2022 à 21:13 > À : Any question about pharo is welcome <pharo-users@lists.pharo.org> > Objet : [Pharo-users] [ANN] Pharo LZ4 Tools > Hi, > > Pharo LZ4 Tools (https://github.com/svenvc/pharo-lz4-tools) is an > implementation of LZ4 compression and decompression in pure Pharo. > > LZ4 is a lossless compression algorithm that is focused on speed. It belongs > to the LZ77 family of byte-oriented compression schemes. > > - https://en.wikipedia.org/wiki/LZ4_(compression_algorithm) > - https://lz4.github.io/lz4/ > - https://github.com/lz4/lz4 > > Both the frame format > (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md) as well as the > block format (https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md) > are implemented. Dictionary based compression/decompression is available too. > The XXHash32 algorithm is also implemented. > > Of course this implementation is not as fast as highly optimised native > implementations, but it works quite well and is readable/understandable, if > you like this kind of stuff. It can be useful to interact with other systems > using LZ4. > > Sven