I am working on a project to analyze web documents; there are two main components that I am researching. Ideally, I want to find out how related one document is compared to another document. That comparison should be based on term frequency, possibly a lexicon word pool lookup? something along those lines. One approach I found is through the use of the Rocchio method.
For the second part, categorization; I am having a little trouble how to do this. I want to build the categories dynamically? For example, if I have 1 billion documents, I want to parse those documents and come up with N number of categories with a sub division of links? Lets say 1000 categories. I was thinking I could create the categories based on the DESCRIPTION/KEYWORDS meta information and then use some bayesian analysis to dump more links into that particular category? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to algogeeks@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/algogeeks -~----------~----~----~----~------~----~------~--~---