WRT to my blog post:
It seems the problem is that the distribution for lengthNorm() starts at 1 and moves down from there. 1.0f would work but HUGE documents would be normalized and so would distort the results.
What would you think of using this implementation for lengthNorm:
This starts the distribution low... approaches 1.0 when 50 terms are in the document... then asymptotically moves to zero from here on out based on sqrt.public float lengthNorm( String fieldName, int numTokens ) {
int THRESHOLD = 50;
int nt = numTokens;
if ( numTokens <= THRESHOLD )
++nt;
if ( numTokens > THRESHOLD )
nt -= THRESHOLD;
float v = (float)(1.0 / Math.sqrt(nt));
if ( numTokens <= THRESHOLD ) v = 1 - v;
return v; }
For example with values from 1 -> 150 would yield (I'd graph this out but I'm too lazy):
1 - 0.29289323 2 - 0.42264974 3 - 0.5 4 - 0.5527864 5 - 0.5917517 6 - 0.6220355 7 - 0.6464466 8 - 0.6666666 9 - 0.6837722 10 - 0.69848865 11 - 0.7113249 12 - 0.72264993 13 - 0.73273873 14 - 0.74180114 15 - 0.75 16 - 0.7574644 17 - 0.7642977 18 - 0.7705843 19 - 0.7763932 20 - 0.7817821 21 - 0.7867993 22 - 0.7914856 23 - 0.79587585 24 - 0.8 25 - 0.80388385 26 - 0.8075499 27 - 0.81101775 28 - 0.81430465 29 - 0.81742585 30 - 0.8203947 31 - 0.8232233 32 - 0.82592237 33 - 0.8285014 34 - 0.83096915 35 - 0.8333333 36 - 0.83560103 37 - 0.83777857 38 - 0.8398719 39 - 0.8418861 40 - 0.84382623 41 - 0.8456966 42 - 0.8475014 43 - 0.84924436 44 - 0.8509288 45 - 0.852558 46 - 0.85413504 47 - 0.85566247 48 - 0.85714287 49 - 0.8585786 50 - 0.859972 51 - 1.0 52 - 0.70710677 53 - 0.57735026 54 - 0.5 55 - 0.4472136 56 - 0.4082483 57 - 0.37796447 58 - 0.35355338 59 - 0.33333334 60 - 0.31622776 61 - 0.30151135 62 - 0.28867513 63 - 0.2773501 64 - 0.26726124 65 - 0.2581989 66 - 0.25 67 - 0.24253562 68 - 0.23570226 69 - 0.22941573 70 - 0.2236068 71 - 0.2182179 72 - 0.21320072 73 - 0.2085144 74 - 0.20412415 75 - 0.2 76 - 0.19611613 77 - 0.19245009 78 - 0.18898223 79 - 0.18569534 80 - 0.18257418 81 - 0.1796053 82 - 0.17677669 83 - 0.17407766 84 - 0.17149858 85 - 0.16903085 86 - 0.16666667 87 - 0.16439898 88 - 0.16222142 89 - 0.16012815 90 - 0.15811388 91 - 0.15617377 92 - 0.15430336 93 - 0.15249857 94 - 0.15075567 95 - 0.1490712 96 - 0.14744195 97 - 0.145865 98 - 0.14433756 99 - 0.14285715 100 - 0.14142136 101 - 0.14002801 102 - 0.13867505 103 - 0.13736056 104 - 0.13608277 105 - 0.13483997 106 - 0.13363062 107 - 0.13245323 108 - 0.13130644 109 - 0.13018891 110 - 0.12909944 111 - 0.12803689 112 - 0.12700012 113 - 0.12598816 114 - 0.125 115 - 0.12403473 116 - 0.12309149 117 - 0.12216944 118 - 0.12126781 119 - 0.120385855 120 - 0.11952286 121 - 0.11867817 122 - 0.11785113 123 - 0.11704115 124 - 0.11624764 125 - 0.11547005 126 - 0.114707865 127 - 0.11396058 128 - 0.1132277 129 - 0.11250879 130 - 0.1118034 131 - 0.11111111 132 - 0.11043153 133 - 0.10976426 134 - 0.10910895 135 - 0.10846523 136 - 0.107832775 137 - 0.107211255 138 - 0.10660036 139 - 0.10599979 140 - 0.10540926 141 - 0.104828484 142 - 0.1042572 143 - 0.10369517 144 - 0.10314213 145 - 0.10259783 146 - 0.10206208 147 - 0.10153462 148 - 0.101015255 149 - 0.10050378
--
Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat.
Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod!
Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator, Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]