Hello,

i am using lda to build a topic model over 30k articles. However i have a problem to get the p(Topic|Document) for topics that have a relative low prior.

For example for one articles there are basically two relevant topics with P(10|articles)=0.09802209698050128 and p(111|articles)=0.9001826471638066.

10

{medikamente:0.004022688319831924,krebs:0.00358368764510196,patient:0.003446116146728106,therapie:0.0033063920383982624,ebola:0.003246694015027121,krankheiten:0.003118233813442163,medizin:0.0030363609305258774,medizinische:0.002735692894516862,sierra:0.0025488043196802953,gehirn:0.0024702977723619,erkrankung:0.0024664127821501956,virus:0.0024636302428390705,weltgesundheitsorganisation:0.0024240035374940316,leone:0.0023821163154330817,medikament:0.0022636661241652403,symptome:0.0022605621780134753,erkrankungen:0.002236741916796797,ärzten:0.002201719708384397,infektion:0.0022005269799190478,diagnose:0.0021200205447816866}

111

{cia:0.0026345359875991226,foltermethoden:0.0014149174673260135,verhörmethoden:0.00140497301590793,waterboarding:0.0011580597910307684,folter:0.0011198728249932808,folterbericht:9.750488561609758E-4,ciafolter:9.63781735282257E-4,jauch:9.625685166431471E-4,methoden:9.52553863964292E-4,ussenats:9.467352925179054E-4,terrorverdächtigen:8.956828892077374E-4,folterpraktiken:8.547966922851059E-4,geheimgefängnissen:8.509344535068084E-4,senatsbericht:7.938900091637415E-4,usgeheimdienstes:7.686889634541992E-4,ciafolterbericht:7.593868675732018E-4,schlafentzug:7.423809643767796E-4,morales:6.574631769728719E-4,foltern:6.504933854409662E-4,geheimdienstausschuss:6.395705008858842E-4}

The articles is basically about allergies. So i would assume that the 10th topic would get a high probability. I build the model again and i found that the probabilistic assignment to topic equivalent to topic 10 gets a similar value. However there is allways another topic that gets a higher probabilistic assignment but semantically this topic always changes.

E.g. antoher solution:

{virus:0.00411216419791256,medikamente:0.0038598661793248137,krebs:0.0035675996400685007,therapie:0.0033905540843555413,ebola:0.003375728280833271,patient:0.0032500773732781823,krankheiten:0.0032139112354069776,weltgesundheitsorganisation:0.0029118597430727276,medizin:0.0026998889092036088,infiziert:0.00267748391624766,infektion:0.002632653711889378,symptome:0.002575108706339337,gehirn:0.002461862973567568,erkrankungen:0.0024427886542873287,sierra:0.0024414884403193737,epidemie:0.002422322695382149,erkrankt:0.0023925116648543065,leone:0.0023567022504187318,zellen:0.002332477878623263,ärzten:0.002322240317943797}

116:0.916783697151615

{fifa:0.008559554977900587,blatter:0.00701348265423905,junior:0.00412718299308423,fußballwm:0.003291121046656695,malanda:0.003113887497889251,katar:0.002563569641333144,afrikacup:0.002391858092898771,fifapräsident:0.002391121904074705,sepp:0.0023618750644608037,ghana:0.0023146047834810383,uefa:0.0021511645824989754,dfb:0.0020546025044629094,platini:0.0019234871158839229,südafrika:0.0019143805256084199,zwanziger:0.0018779777661522455,garcia:0.0018383670116093194,äquatorialguinea:0.0018332527223905645,weltverband:0.001825326740045217,elfenbeinküste:0.0017401333783730218,autounfall:0.0017144740032972864}

So my questions are:

What can i do to get a more stable topic assignment?

What can i do to make it correct?

Might this be related to different topic priors?


Best,

Max

Reply via email to