Thanks mingfai and thorsten for your answers, and sorry for the looooong time reaction (I was a little "stack overflow")
This help me to better understand Droids. I already do something like I try to describe to you in Lenya. During this I was facing of this problems : - html page who use frameset (content of frame is not retrieve) - encoding type of page - malformed and faulty HTML (<p> with no </p> etc...) - pages that are 403 movedpermanently When droids give me a page... this is a 100% clean (x)html ? ;) And on another think : - how is the deal with flash, img, .zip,... : just a link or a download ? - how is the deal with javascript ? - with forms ? Thanks On Tue, 14 Jul 2009 17:11:35 +0800, Mingfai <[email protected]> wrote: > hi, > > >> So let's go : >> > >> > I would like to pass to droids an xml like (just an example) : >> > <article> >> > <droids:url>http://example.com/test.html</droids:url> >> >> In droids crawling the url is the entrance point of the processing. What >> happens then is highly configurable and currently Ming Fai has suggested >> some changes for the future. I will describe the possibilities that >> droids currently offers for the presented use case. >> >> Like said we start with the queue where you inject the starting urls. >> Then this queue will call a worker (which basically is the part of the >> code where the real work is done). This worker may call a linkExtractor >> and/or a Parser to extract link and any other information about the >> incoming page. > > > > I think most crawler (incl. Droids and any of my suggested change) works in > more or less the same way. We always have URL as seeds and be put in a > queue/list (TaskQueue in Droids), a main component to control multi-thread > and execution (TaskMaster), components to fetch/retrieve the URL as > inputstream/entity (Worker and Protocol), components to parse/process the > inputstream/entity (Parser), components to extract outlinks (LinkExtractor) > and put back into the main queue/list.(Worker) Droids also has URLFilter > that accept/reject outlinks, TaskValidator to intecept at the > add-to-queue-time (that works similar as URLFilter for crawling, maybe u > could ignore this), DelayTimer to slow down the fetching. The above refers > to the current Droids implementation. I think it covers most of the main > concepts. > > regards, > mingfai
