Re: [Nutch-general] classifying content

kauu Fri, 08 Dec 2006 03:45:20 -0800

hi:
 i don't very be well with nutch, and i think that nutch should classify
the pages after fetching  them  every time  into different place. then u can
search them and display them .


On 12/8/06, Shay Lawless <[EMAIL PROTECTED]> wrote:


Hi Chad,

I use a focused web crawler called the metacombine project (
http://www.metacombine.org/) to classify the content retrieved during a
web
crawl. It implements the Heritrix web crawler from the Internet Archive
and
the Rainbow Text Classifier from CMU. Not sure if you can use it to crawl
for multiple categories at once, might take a bit of alteration, I use it
to
crawl for one specific topic or category at a time. Have a look a the web
site. If it sounds like something that might work for you give me a shout
back.

Thanks

Shay

On 07/12/06, chad savage <[EMAIL PROTECTED]> wrote:
>
> Hey Eelco,
>
> We would like to organize information into a hierarchical category
> system.  It's all general web content(html from the web).
> Yes, there are a number of references to varying techniques on the net
> (scientific papers, theoretical, practical, mind boggling). My problem
> is determining the best method. and of course implementing it with my
> limited nutch/java abilities.  May have to outsource most of this.
> Not to mention the many formats for ontologies: owl,rdf,daml, some
> others I am sure I'm missing.
>
> We would like to be able to crawl the web and categorize the pages into
> buckets.  We currently have a number of separate configs for nutch all
> crawling different subsets of our web sites with multiple indexes as a
> start for being able to search separate categories.  The goal is to have
> one crawl that can scan all of the websites and index the content into
> these predetermined buckets and keep them in one master index.
>
> If there are any groups out there that handle this I would be more than
> happy to discuss techniques and possible outsourcing.
>
> Chad
>
>
> Eelco Lempsink wrote:
> > On 5-dec-2006, at 7:01, chad savage wrote:
> >> I'm doing some research on how to classify documents into pre-defined
> >> categories.
> >
> > On basis of...?  The technique that's the most appropriate depends on
> > the type of documents and the type of categories. For instance, are
> > the documents structured (e.g. all XML using a common definition) or
> > unstructured data (HTML from the web)?  Are you looking the place
> > documents in a large hierarchical category system or is it a simple
> > binary decision (e.g. 'Spam' or 'No spam').
> >
> > If you know what you want and how it's called it should be relatively
> > easy to find information and scientific papers about it.
> >
> > --Regards,
> >
> > Eelco Lempsink
> >
>



--
www.babatu.com

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] classifying content

Reply via email to