Markus,
Please see example 4 in the user docs
http://madlib.apache.org/docs/latest/group__grp__lda.html#examples
which provides helper functions for learning more about the learned model.
-- The topic description by top-k words
DROP TABLE IF EXISTS my_topic_desc;
SELECT madlib.lda_get_topic_desc( 'my_model',
'my_training_vocabulary',
'my_topic_desc',
15);
select * from my_topic_desc order by topicid, prob DESC;
produces:
topicid | wordid | prob | word
---------+--------+--------------------+-------------------
1 | 69 | 0.181900726392252 | of
1 | 52 | 0.0608353510895884 | is
1 | 65 | 0.0608353510895884 | models
1 | 30 | 0.0305690072639225 | corpora
1 | 1 | 0.0305690072639225 | 1960s
1 | 57 | 0.0305690072639225 | latent
Please let us know if this is of use, or you are looking for something else?
Frank
On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <[email protected]>
wrote:
> Hi,
>
> I found a working but quite awkward way to form docid-wordid-topicid
> pairing with a single SQL query:
>
> SELECT docid, unnest((counts::text || ':' ||
> words::text)::madlib.svec::float[])
> AS wordid, unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE
> docid = 6;
>
> Output:
>
> docid | wordid | topicid
> -------+--------+---------
> 6 | 7386 | 3
> 6 | 42021 | 17
> 6 | 42021 | 17
> 6 | 7705 | 12
> 6 | 105334 | 16
> 6 | 18083 | 3
> 6 | 89364 | 3
> 6 | 31073 | 3
> 6 | 28934 | 3
> 6 | 28934 | 16
> 6 | 56286 | 16
> 6 | 61921 | 3
> 6 | 61921 | 3
> 6 | 59142 | 17
> 6 | 33364 | 3
> 6 | 79035 | 17
> 6 | 37792 | 11
> 6 | 91823 | 11
> 6 | 30422 | 3
> 6 | 94672 | 17
> 6 | 62107 | 3
> 6 | 94673 | 2
> 6 | 62080 | 16
> 6 | 101046 | 17
> 6 | 4379 | 8
> 6 | 4379 | 8
> 6 | 4379 | 8
> 6 | 4379 | 8
> 6 | 4379 | 8
> 6 | 26503 | 12
> 6 | 61105 | 3
> 6 | 19193 | 3
> 6 | 28929 | 3
>
>
> Is there any simpler way to do that?
>
>
> Regards,
> Markus Paaso
>
>
>
> 2017-08-11 15:23 GMT+03:00 Markus Paaso <[email protected]>:
>
>> Hi,
>>
>> I am having some problems reading the LDA output.
>>
>>
>> Please see this row of madlib.lda_train output:
>>
>> docid | 6
>> wordcount | 33
>> words | {7386,42021,7705,105334,18083,
>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823,
>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929}
>> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,5,1,1,1,1}
>> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0}
>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2
>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2}
>>
>>
>> It's hard to find which word ids are topic ids assigned to given when
>> *words* array have different length than *topic_assignment* array.
>> It would be nice if *words* array was same length than *topic_assignment*
>> array
>>
>> 1. What kind of SQL query would give a result with wordid - topicid pairs?
>> I tried to match them by hand but failed for wordid: 28934. I wonder if a
>> repeating wordid can have different topic assignments in a same document?
>>
>> wordid | topicid
>> ----------------
>> 7386 | 2
>> 42021 | 16
>> 7705 | 11
>> 105334 | 15
>> 18083 | 2
>> 89364 | 2
>> 31073 | 2
>> 28934 | 2 OR 15 ?
>> 56286 | 15
>> 61921 | 2
>> 59142 | 16
>> 33364 | 2
>> 79035 | 16
>> 37792 | 10
>> 91823 | 10
>> 30422 | 2
>> 94672 | 16
>> 62107 | 2
>> 94673 | 1
>> 62080 | 15
>> 101046 | 16
>> 4379 | 7
>> 26503 | 11
>> 61105 | 2
>> 19193 | 2
>> 28929 | 2
>>
>>
>> 2. Why is the *topic_assignment* using zero based indexing while other
>> results use one base indexing?
>>
>>
>>
>> Regards,
>> Markus Paaso
>>
>
>
>
> --
> Markus Paaso
> Tel: +358504067849 <+358%2050%204067849>
>