Yes. Splitting by paragraph should work fine (been there, done that). Splitting by sentence works well if you does something like SVD to smooth over the fact that you have few words per sentence.
Splitting by paragraph is pretty easy, but corpus specific. For plain text, try looking for blank lines. For HTML make a list of breaking markup and insert split points whereever you find those. For other formats you will need to put on your thinking cap. Sentence splitting is easy to do 90% correctly, hard to do better than 99% especially in some domains. For your purposes, 90% is probably fine. Start with the simplest possible case and add a few special cases and you will be set. There may be usable software to be found on the net, but your needs are very modest. Good luck! Let us know how it goes. On Fri, Apr 30, 2010 at 10:32 AM, Bogdan Vatkov <[email protected]>wrote: > Btw, why do you think splitting and clustering won't work? Have anybody > tried this? > I am not sure it will be successful but I also do not have the arguments > that it should not lead to a meaningful result. > If I split a doc per sentence it might not get good results but if I use > larger pieces, e.g. paragraphs it might give some topics (sets of > keywords). > Anyone tried something like this? >
