> for TokenStreams, they are simple, accessed by only one thread: you > can also do a proper LRU via LinkedHashMap which is maybe even less > code. >
Maybe surprisingly, the naive approach of just discarding the hash map worked surprisingly well on real data (for me). The difference between LRU (and anything else) compared to that naive strategy was fairly small. Updated that branch to include LRU as well, here are the results: Naive clear on hashmap: Cache size: 0: Stemming en: average 1934.5714285714287, all times = [1952, 1941, 1922, 1932, 1938, 1931, 1926] Cache size: 1000: Stemming en: average 990.2857142857143, all times = [981, 984, 985, 995, 999, 997, 991] Cache size: 2000: Stemming en: average 879.7142857142857, all times = [878, 875, 874, 882, 880, 885, 884] Cache size: 4000: Stemming en: average 774.2857142857143, all times = [771, 770, 773, 774, 775, 777, 780] Cache size: 8000: Stemming en: average 659.0, all times = [652, 658, 657, 673, 672, 655, 646] Cache size: 0: Spellchecking en: average 2422.285714285714, all times = [2421, 2418, 2437, 2445, 2422, 2406, 2407] Cache size: 1000: Spellchecking en: average 1264.857142857143, all times = [1259, 1251, 1255, 1254, 1274, 1274, 1287] Cache size: 2000: Spellchecking en: average 1172.2857142857142, all times = [1189, 1185, 1185, 1157, 1159, 1169, 1162] Cache size: 4000: Spellchecking en: average 1058.0, all times = [1052, 1050, 1063, 1070, 1059, 1057, 1055] Cache size: 8000: Spellchecking en: average 937.0, all times = [932, 942, 943, 927, 925, 935, 955] LRU on linked hash map: Cache size: 0 Stemming en: average 1960.142857142857, all times = [1955, 1954, 1966, 1974, 1960, 1957, 1955] Cache size: 1000 Stemming en: average 928.8571428571429, all times = [921, 929, 953, 929, 925, 924, 921] Cache size: 2000 Stemming en: average 817.8571428571429, all times = [818, 818, 818, 820, 819, 815, 817] Cache size: 4000 Stemming en: average 706.0, all times = [705, 705, 708, 705, 706, 706, 707] Cache size: 8000 Stemming en: average 583.0, all times = [583, 584, 582, 584, 582, 585, 581] Cache size: 0 Spellchecking en: average 2452.714285714286, all times = [2496, 2477, 2436, 2435, 2441, 2439, 2445] Cache size: 1000 Spellchecking en: average 1203.7142857142858, all times = [1205, 1204, 1203, 1201, 1202, 1205, 1206] Cache size: 2000 Spellchecking en: average 1108.0, all times = [1110, 1107, 1105, 1106, 1106, 1111, 1111] Cache size: 4000 Spellchecking en: average 1004.0, all times = [999, 995, 993, 996, 997, 1019, 1029] Cache size: 8000 Spellchecking en: average 885.1428571428571, all times = [902, 904, 885, 872, 876, 877, 880] A cache-all for comparison (a lower bound... of sorts): Cache size: 0 Stemming en: average 36.142857142857146, all times = [53, 36, 35, 32, 32, 33, 32] Cache size: 1000 Stemming en: average 33.285714285714285, all times = [32, 31, 32, 35, 34, 35, 34] Cache size: 0 Spellchecking en: average 31.285714285714285, all times = [33, 32, 36, 30, 29, 30, 29] Cache size: 1000 Spellchecking en: average 31.571428571428573, all times = [31, 31, 32, 31, 31, 35, 30] but putting caches around tokenstream/stemmer has implications, if the > user has huge numbers of threads and especially high churn (like solr > with its dynamic threadpool with max of 10000, my earlier mail). Right. I'm not saying it's for everyone, just throwing an idea that worked really well in practice. another alternative would be to centralize it around hunspell > "Dictionary", but then it needs to be thread-safe and so on. > I'd leave it for application layers higher up, to be honest. Where you know the usage context better. D.
