Thanks Ted! That was what I needed! On Fri, Apr 30, 2010 at 10:21 PM, Ted Dunning <[email protected]> wrote:
> Yes. Splitting by paragraph should work fine (been there, done that). > > Splitting by sentence works well if you does something like SVD to smooth > over the fact that you have few words per sentence. > > Splitting by paragraph is pretty easy, but corpus specific. For plain > text, > try looking for blank lines. For HTML make a list of breaking markup and > insert split points whereever you find those. For other formats you will > need to put on your thinking cap. > > Sentence splitting is easy to do 90% correctly, hard to do better than 99% > especially in some domains. For your purposes, 90% is probably fine. > Start > with the simplest possible case and add a few special cases and you will be > set. There may be usable software to be found on the net, but your needs > are very modest. > > Good luck! > > Let us know how it goes. > > On Fri, Apr 30, 2010 at 10:32 AM, Bogdan Vatkov <[email protected] > >wrote: > > > Btw, why do you think splitting and clustering won't work? Have anybody > > tried this? > > I am not sure it will be successful but I also do not have the arguments > > that it should not lead to a meaningful result. > > If I split a doc per sentence it might not get good results but if I use > > larger pieces, e.g. paragraphs it might give some topics (sets of > > keywords). > > Anyone tried something like this? > > > -- Best regards, Bogdan
