[algogeeks] Web content analysis (document distances / categorization)

ramzabean Wed, 02 Jan 2008 12:07:43 -0800

I am working on a project to analyze web documents;   there are two
main components that I am researching.  Ideally, I want to find out
how related one document is compared to another document.  That
comparison should be based on term frequency, possibly a lexicon word
pool lookup? something along those lines.  One approach I found is
through the use of the Rocchio method.


For the second part, categorization; I am having a little trouble how
to do this.  I want to build the categories dynamically?  For example,
if I have 1 billion documents, I want to parse those documents and
come up with N number of categories with a sub division of links?
Lets say 1000 categories.  I was thinking I could create the
categories based on the DESCRIPTION/KEYWORDS meta information and then
use some bayesian analysis to dump more links into that particular
category?

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

[algogeeks] Web content analysis (document distances / categorization)

Reply via email to