> On Jan 9, 2022, at 6:15 PM, burak serdar <bser...@computer.org> wrote:
> 
> I think the question is "why do you want to intern strings". The solution can 
> be different for different use cases.

Indeed. This is a given! Such tricks should only be tried if the benefits
outweigh the cost, as per profiling expected use cases for your program.

[Aside:
Though it is unfortunate that most (all?) programming languages
do not make it easy to try out alternative data structures. As an
example, try replacing processing an array of structs with a struct
of arrays, with each struct member stored in a different array!
Row oriented vs Column oriented processing. 
]

> I work with large Json files where keys are repeated, and the keys are long 
> strings (URLs) so using a map[string]string to intern the keys makes a lot of 
> sense. For serialization, the same idea is used to compress large data using 
> string tables: a []string holds all unique strings, and references to those 
> strings are represented using uint32 indexes. For the example where you 
> intern words, you might even consider a single []byte containing all the 
> words concatenated, with a custom string representation containing uint32 
> offset and length. This may or may not be faster than the map implementation, 
> but it will occupy less space, but with a single large object in memory.
> 
> On Sun, Jan 9, 2022 at 5:35 PM Bakul Shah <ba...@iitbombay.org 
> <mailto:ba...@iitbombay.org>> wrote:
> The string header will be 16 bytes on 64bit word size machines.
> If most of the words are much shorter, interning won’t buy you
> much. For applications where you *know* all the words are short,
> and the total string space won’t exceed 4GB, you can try other
> alternatives. For instance if the max length of any word is 16 bytes,
> create 16 string tables. Then the ‘interned string’ type is a 4 byte
> index, with the bottom 4 bits indicating the length as well as which
> table to index! A table for N byte words stores one copy of each
> unique word and in consecutive slots on N bytes. Given that on modern
> machines even though memory can be plentiful, memory access is slow
> & computing is fast, this can win. You will also need a possibly unsafe
> String func. that doesn’t allocate string data space. You can even allow
> an overflow table if longer words are not common. And you can allow
> many more unique words by extending the interned string type to 8
> bytes.
> 
>> On Jan 9, 2022, at 3:07 PM, burak serdar <bser...@computer.org 
>> <mailto:bser...@computer.org>> wrote:
>> 
>> 
>> Note that a with a map[string]string, the code:
>> 
>> m[s]=s
>> 
>> The contents of the string s are not duplicated, only the string header s 
>> is. 
>> 
>> On Sun, Jan 9, 2022 at 3:52 PM jlfo...@berkeley.edu 
>> <mailto:jlfo...@berkeley.edu> <jlforr...@berkeley.edu 
>> <mailto:jlforr...@berkeley.edu>> wrote:
>> I'm aware of Artem Krylysov's idea for string interning published on
>> https://artem.krylysov.com/blog/2018/12/12/string-interning-in-go/ 
>> <https://artem.krylysov.com/blog/2018/12/12/string-interning-in-go/>
>> 
>> If I understand it correctly, each string is stored twice in a map, once as
>> a key and once as a value. That means that words that only appear once in
>> his example of string interning the words in the novel 1984 are 100% 
>> inefficient
>> in terms of storage. Words that appear twice are neutral. The advantage of 
>> his
>> approach only appears for words that appear three or more times.
>> 
>> I'm wondering if there's a map-like data structure that would store a string
>> as the key, and the address of the key as the value. I'm aware that a 
>> standard
>> Go map can't be used for this because its components might be moved around
>> while a program is running so taking the address of the key would be 
>> dangerous,
>> which is why it isn't allowed.
>> 
>> This came up because I profiled a run of an app I'm working on that
>> processed 12518 strings, of which 9810 appeared 2 or fewer times. Krylysov's
>> approach would only be marginally useful for this app.
>> 
>> The data structure I'm looking for would always be at least neutral, and 
>> would start
>> showing a benefit when a word appears twice.
>> 
>> Does anybody know of such a data structure?
>> 
>> Cordially,
>> Jon Forrest
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts+unsubscr...@googlegroups.com 
>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/golang-nuts/36684584-11eb-4bc0-a436-4906d523c8ban%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/36684584-11eb-4bc0-a436-4906d523c8ban%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts+unsubscr...@googlegroups.com 
>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/golang-nuts/CAMV2RqodPp7kye6%3D5PH0NHha%2BwQC4mqfotipoutm4aeL1EK0Fw%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/CAMV2RqodPp7kye6%3D5PH0NHha%2BwQC4mqfotipoutm4aeL1EK0Fw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/11138851-5C06-4412-BA96-1533E2A5B5A6%40iitbombay.org.

Reply via email to