Neil Mitchell wrote:
2) The storage for String seems to be raw strings, which is nice.
Would I get a substantial speedup by moving to bytestrings instead of
strings? If I hashed the strings and stored common ones in a hash
table is it likely to be a big win?

Bytestrings should help. The big wins in this application are likely to be cache issues, though the improved memory/GC overhead is nice too.


If you have many identical strings then you will save lots by memoizing your strings into Integers, and then serializing that memo table and the integerized version of your data structure. The amount of savings decreases as the number of duplications decrease, though since you don't need the memo table itself you should be able to serialize it in a way that doesn't have much overhead.

A refinement on memoization for when you have many partial matches but few total matches, is to chunk the strings into a series of partial matches, which are then integerized. The trick is deciding on how big to make your chunks (which is to say, optimizing the reuse ratio). If you want an industrial solution like this, it may be worth looking into Haskell bindings for SRILM or IRSTLM.


Depending on how rough your type signature was, you may also consider using bytestring-trie to replace the Map String portion. For the keys of your map this will give the bytestring optimization as well as prefix sharing. You could also replace the [String] with Trie Pos where Pos gives the position in the list, or Trie() if order is irrelevant. And you could replace [(String,Int)] with Trie (Map Pos Int) or similar depending on whether position is important (thus Trie (Set Int)) or whether you can avoid conflicting Ints for the same String (thus Trie Int). Without knowing more I'm not sure how well

    Trie [Either (String, Trie Pos) (Trie (Map Pos Int))]

will match your use case, but it's worth considering. The one thing of note is that Trie uses patricia trees like IntMap does rather than using balanced trees like Map does. Again, I'm not sure whether the differences will matter for your use case.

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to