Since we have such strange plugin structure (DI? IoC?), and many utility
classes with a single UNIX shell script to run everything...


1. Separate concerns. Clearly.
- Crawl
- Parse
- Generate URL List
- Crawl
- ...
(Interfaces of WebDB should be more clear, so we can use databases, etc,...)

1a. Data Mining (finding new language constructs)


2. Automate Classification
- Anchor text is the true subject of a page
- Page contains anchors
- Anchor Text is The Class of referenced pages
Sample: the page "Network Cards" has referenced pages. The page "Computer
Hardware" has a link with anchor text "Network Cards".


3. Data Mining (???)
- String Tokenization
- Sentence
- Human Language
- AJAX, Red Rouge, Opteron, Break Barrel, Caviar, The Jacobian Conjecture,
... - different language constructs for different sites?

Nothing "Agile".

Many staff changed in a trunk, such as 'Link' and 'WebDB', it simplifies...

Thanks



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to