I wrote a draft coclustering script: https://gist.github.com/larsmans/6753565
Instead of going through the present files, it parses the complete
history to also find the developers of files that have already
disappeared.
-
Hi,
I'd start by going over the contributing tips in the documentation:
http://scikit-learn.org/stable/developers/
There are several suggestions in there of where you might get started. In
particular, if you have a good understanding of machine learning methods
and concepts, improving the document
I read in a tutorial that I can download many wikipedia articles and use them
on LSA. If I use 500k random articles from wikipedia plus my 2 documents that I
want to find the similarity score, will I have results like TF-IDF? Or do I
have to use related documents with my 2 docs?
> From: larsm..
2013/9/29 Tasos Ventouris :
> Thank you for your answer. I checked it with many documents. Both totaly
> different and similar documents. You can see an example of the text I used
> here https://dl.dropboxusercontent.com/u/37124455/documents.txt
>
> Another script I wrote with only tf-idf shows me
Thank you for your answer. I checked it with many documents. Both totaly
different and similar documents. You can see an example of the text I used here
https://dl.dropboxusercontent.com/u/37124455/documents.txt
Another script I wrote with only tf-idf shows me 69% similarity on those
documents.
2013/9/29 Tasos Ventouris :
> I am trying to create a script to compute the similarity for only two
> documents. I wrote this code but if I use two docs on the data set, the
> results is a 2x2 matrix with [[1,0],[0,1]]. If I use more than 2 documents,
> the results are almost correct. Any suggestio
I am trying to create a script to compute the similarity for only two
documents. I wrote this code but if I use two docs on the data set, the results
is a 2x2 matrix with [[1,0],[0,1]]. If I use more than 2 documents, the results
are almost correct. Any suggestion?
def lsa(doc1,doc2):datas