Hey Chad,

On 7-dec-2006, at 18:52, chad savage wrote:
We would like to organize information into a hierarchical category system. It's all general web content(html from the web). Yes, there are a number of references to varying techniques on the net (scientific papers, theoretical, practical, mind boggling). My problem is determining the best method. and of course implementing it with my limited nutch/java abilities. May have to outsource most of this. Not to mention the many formats for ontologies: owl,rdf,daml, some others I am sure I'm missing.

Unfortunately, letting a machine organize information is not a trivial problem, so if you have no previous experience with it, you might easily be overwhelmed by all the theories and file formats. Fortunately, though, you might not need to use such a technique at all, because often there are other ways to classify text, for example simple metadata:

We would like to be able to crawl the web and categorize the pages into buckets. We currently have a number of separate configs for nutch all crawling different subsets of our web sites with multiple indexes as a start for being able to search separate categories. The goal is to have one crawl that can scan all of the websites and index the content into these predetermined buckets and keep them in one master index.

When you say "our websites" do you mean websites you maintain? In that case it could be trivial, depending on your content management system, to add some extra information to each page about which 'bucket' it should be placed in.

Otherwise, since you apparently have some configurations separating the different categories, it might be possible to translate that to a plugin which hooks in as a HtmlParseFilter and attaches some metadata to your parsed content.

On the wiki you'll find an example using similar techniques. See http://wiki.apache.org/nutch/WritingPluginExample

--
Regards,

Eelco Lempsink

Attachment: PGP.sig
Description: This is a digitally signed message part

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to