[ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224619#comment-13224619 ]
Robert Muir commented on CODEC-132: ----------------------------------- Thomas: I haven't tested your patch with Lucene/Solr, but I'm +1 on premise. In reality the random testing we do may seem absurd... yes in a way its totally unrealistic. On the other hand if someone is indexing/crawling data, often times this type-detection of either file-type or character set or whatever is really just a heuristic: its really impossible to ultimately prevent the indexing of some binary file (e.g. misdetected character set or simply a video file or whatever). This is part of why we do the testing we do. Thanks again everyone for digging in and reviewing. > BeiderMorseEncoder OOM issues > ----------------------------- > > Key: CODEC-132 > URL: https://issues.apache.org/jira/browse/CODEC-132 > Project: Commons Codec > Issue Type: Bug > Affects Versions: 1.6 > Reporter: Robert Muir > Attachments: CODEC-132.patch, CODEC-132_test.patch > > > In Lucene/Solr, we integrated this encoder into the latest release. > Our tests use a variety of random strings, and we have recent jenkins failures > from some input streams (of length <= 10), using huge amounts of memory (e.g. > > 64MB), > resulting in OOM. > I've created a test case (length is 30 here) that will OOM with -Xmx256M. > I haven't dug into this much as to what's causing it, but I suspect there > might be a bug > revolving around certain punctuation characters: we didn't see this happening > until > we beefed up our random string generation to start producing "html-like" > strings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira