On Sep 26, 2017, at 4:41 PM, Thomale, Jason <jason.thom...@unt.edu> wrote:

>>>> Does anybody here know how to access a Python compressed sparse row format 
>>>> (CSR) object? [1]
>>>> 
>>>> [1] CSR - http://bit.ly/2fPj42V
>>> 
>>> Do you have a link to the code you're using?
>> 
>> Yes, thank you. See —> 
>> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM
> 
> I'm not familiar with the APIs in question, but--if I'm looking at this 
> right, your CSR matrix (tfidf) looks like it would have columns corresponding 
> with topics and rows corresponding with documents. If that's the case, you 
> could maybe do something like this:
> 
>   1. Use tfidf.getcol() to get the column corresponding
>      to your chosen topic. Looks like that should give you a
>      1-dimensional matrix of all document scores for that
>      topic.
> 
>   2. Cast that to an array of scores using .toarray(),
>      and then a list with .tolist(). (I think?)
> 
>   3. Use a list comprehension and "enumerate" to generate
>      explicit doc IDs based on each document's position in
>      the list, creating a list of 2-element lists or tuples,
>      (doc_id, score). While you're at it, you could filter
>      the list comprehension to give you only the documents
>      with scores that are greater than 0, or some other
>      threshold.
> 
>   4. Pass the results through the built-in "sorted"
>      function to sort your list of tuples based on score.
> 

> >>> topic = 9497
> >>> score_thresh = 0
> >>> topic_scores = tfidf.getcol(topic).toarray().tolist()
> >>> docs_and_scores = [(score[0], score[1]) for score in 
> >>> enumerate(topic_scores) if item[1] > score_thresh]
> >>> most_relevant_docs = sorted(docs_and_scores, key=lambda x: x[1])
> 
> The resulting "most_relevant_docs" variable should be a list of tuples that 
> looks something like this (for example):
> [(102, 0.9), (33, 0.875), (365, 0.874), ...]
> 
> Not sure if that's helpful...? There's probably a more numpy/scipy way of 
> doing the above using actual numpy array methods (especially the 4th line).


Jason, this is REALLY close, and I have begun to include it at the very end of 
my code. Thank you! ‘More later. code4lib++  —Eric Morgan

Reply via email to