Hi: In general terms the CC plugin looks for the "CC:license" on web pages it crawls. You can see that in http://creativecommons.org/ at the end of the page - there is a "CC logo and some copyright text". If you do view source will give you the HTML's for that bit of the page .. and when ever nutch crawler finds such page it index the page otherwise delete the page and move to the next page. In essance this HTML snippet could be anything i.e. specific text, group of text and what not.
Whenever CC plugin finds a CC page it also adds some CC specific fields in Lucene index for query etc. I think all of the above i.e. CCparser, CCindexer and CCquery filters are under the CC plugin directory. Cheers On 1/9/07, Justin Hartman <[EMAIL PROTECTED]> wrote: > On 1/9/07, Zaheed Haque <[EMAIL PROTECTED]> wrote: > > there is a creative commons plugin in nutch src/plugin/creativecommons .. > > which > > does somewhat similar things could be good starting point. > > Sorry to change the subject on this one but what exactly does the > creativecommons plugin do and how would you use it? I've been very > interested in this plugin but it's not altogether documented that well > (I don't think). > -- > Regards > Justin Hartman > PGP Key ID: 102CC123 > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
