[GitHub] incubator-joshua issue #48: Fixed crashing when using Trie based KenLM model...
Github user kpu commented on the issue: https://github.com/apache/incubator-joshua/pull/48 See my comments on https://github.com/kpu/kenlm/pull/66 . This makes sense now and explains the behaviour, but I still think there is a non-segfault bug related to backoffs. Another, IMHO cleaner, solution is to use separate Chart objects for each LM for each sentence. How reasonable is that to accomplish Joshua side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-joshua issue #48: Fixed crashing when using Trie based KenLM model...
Github user KellenSunderland commented on the issue: https://github.com/apache/incubator-joshua/pull/48 Hey Matt, exactly right for the summary. I am actually not sure how often collisions happen, it may be worth measuring. Collisions causing crashing in this scenario happen once in about 100k-250k translation requests. This is a less likely scenario than just a state collision though, as you have to get a collision that gives you an out-of-range word id as a unigram. We've tested turning off state sharing between KenLMs, and indeed that also solves the crashing issue. I don't think I know the details of KenLM and Joshua well enough to properly judge if this would be a better solution than the one I provided. If someone with more knowledge then I provides a new PR I'll happily +1 it. One downside to just turning off state sharing is that we will still get collisions, we just won't get crashing. I think if have collisions (even with a single model) we usually get an incorrect result (not a crash). There are some other implementation approaches that could also be considered to fix the issue too (I'd like to hear what Kenneth thinks would be the best approach). One idea would be to define a standard hash function for the State struct, and then we could use the State itself as a key for a normal unordered_map. Then we wouldn't need this multimap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-joshua issue #48: Fixed crashing when using Trie based KenLM model...
Github user mjpost commented on the issue: https://github.com/apache/incubator-joshua/pull/48 Holy smokes, thanks for tracking this down. So if I understand correctly, this only occurs under the following circumstances: - decoding with multiple KenLM language models - built with different vocabularies (the usual case) - a hash collision occurs and returns a state containing an ID that is invalid in the calling KenLM Do you have any idea how often hash collisions actually occur? I wonder if turning off sharing of KenLM states across LMs would also have worked, with little to no effect on performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---