Hi:

In general terms the CC plugin looks for the "CC:license" on web pages
it crawls. You can see that in http://creativecommons.org/ at the end
of the page - there is a "CC logo and some copyright text". If you do
view source will give you the HTML's for that bit of the page .. and
when ever nutch crawler finds such page it index the page otherwise
delete the page and move to the next page. In essance this HTML
snippet could be anything i.e. specific text, group of text and what
not.

Whenever CC plugin finds a CC page it also adds some CC specific
fields in Lucene index for query etc. I think all of the above i.e.
CCparser, CCindexer and CCquery filters are under the CC plugin
directory.

Cheers

On 1/9/07, Justin Hartman <[EMAIL PROTECTED]> wrote:
> On 1/9/07, Zaheed Haque <[EMAIL PROTECTED]> wrote:
> > there is a creative commons plugin in nutch src/plugin/creativecommons .. 
> > which
> > does somewhat similar things could be good starting point.
>
> Sorry to change the subject on this one but what exactly does the
> creativecommons plugin do and how would you use it? I've been very
> interested in this plugin but it's not altogether documented that well
> (I don't think).
> --
> Regards
> Justin Hartman
> PGP Key ID: 102CC123
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to