String interning (seen on Github) : Is it good? has anybody been using it?

cblake Thu, 30 Jul 2020 08:00:29 -0700

This is sort of a first step along the lines I was suggesting: 
    
    
    import tables
    
    type
      Strings = object
        strings*: string
        str2ix*: Table[string, int32]
    
    proc put*(s: var Strings, x: string) =
      let n = s.strings.len.int32
      if s.str2ix.mgetOrPut(x, n) == n:
        s.strings.add x & "\0"
    
    proc c_strlen(s: cstring): csize_t {.
            importc: "strlen", header: "<string.h>" .}
    proc str*(s: var Strings, i: int32): string =
      let n = s.strings[i].addr.c_strlen
      result.setLen n           # this does the zero byte for us
      copyMem result[0].addr, s.strings[i].addr, n
    
    when isMainModule:
      var s: Strings
      s.put "two"; s.put "does"; s.put "use"; s.put "two"
      echo s.str2ix
      echo s.str(9)
    
    
    Run


Then 
    
    
    $ nim r intern.nim
    {"use": 9, "does": 4, "two": 0}
    use
    
    
    Run

Of course, you get sparse indices (0, 4, 9) not a dense (0,1,2) word numbering 
this way, but this was unspecified by @Serge's question. Sparse indices are 
already "fast like integers" and so may be all you really need. If you need 
dense numbers, you could do more metadata like another `seq` to map word to 
offset { also used in the customized table to avoid `string`, etc. }

There are many succint representations like tries, Nim's own `critbits`, .., 
but I doubt they are much smaller (if not bigger) and are likely slower in 
practice than the above simple idea, and they are often substantially more 
complex to code.

String interning (seen on Github) : Is it good? has anybody been using it?

Reply via email to