I will check it but I am not sure I will have the right knowledge to implement it, is there a ready to be used impl somewhere? Btw, why do you think splitting and clustering won't work? Have anybody tried this? I am not sure it will be successful but I also do not have the arguments that it should not lead to a meaningful result. If I split a doc per sentence it might not get good results but if I use larger pieces, e.g. paragraphs it might give some topics (sets of keywords). Anyone tried something like this?
On Fri, Apr 30, 2010 at 8:24 PM, Grant Ingersoll <[email protected]>wrote: > > On Apr 30, 2010, at 1:15 PM, Robin Anil wrote: > > > On Fri, Apr 30, 2010 at 10:40 PM, Bogdan Vatkov <[email protected] > >wrote: > > > >> Hi Grant, > >> > >> You are probably right. > >> What I wanted is to use my mahout setup to extract topics from a single > >> document. > >> So, maybe in popular terms I am trying to do topic extraction via > document > >> clustering. > >> Does it make sense to try to split a doc into sub docs so that I > leverage > >> the clustering algorithm and thus find topic which appear key ones for > the > >> document? > >> > > Have you heard of LDA (Its in Mahout). Or are you trying to do something > > different for topic extraction ? > > That's more across docs. You might also have a look at TextRank, which is > a graph based approach to keyword/topic extraction that is nice to implement > (one of these days, I'll do it in Mahout) -- Best regards, Bogdan
