Re: Need 1 :

Florent André Tue, 11 Aug 2009 18:52:50 -0700

Thanks mingfai and thorsten for your answers, and sorry for the looooong
time reaction (I was a little "stack overflow")


This help me to better understand Droids.

I already do something like I try to describe to you in Lenya. During this
I was facing of this
problems : 
- html page who use frameset (content of frame is not retrieve)
- encoding type of page 
- malformed and faulty HTML (<p> with no </p> etc...)
- pages that are 403 movedpermanently

When droids give me a page... this is a 100% clean (x)html ? ;)

And on another think : 
- how is the deal with flash, img, .zip,... : just a link or a download ?
- how is the deal with javascript ? 
- with forms ?

Thanks

On Tue, 14 Jul 2009 17:11:35 +0800, Mingfai <[email protected]> wrote:
> hi,
> 
> 
>> So let's go :
>> >
>> > I would like to pass to droids an xml like (just an example) :
>> > <article>
>> >   <droids:url>http://example.com/test.html</droids:url>
>>
>> In droids crawling the url is the entrance point of the processing. What
>> happens then is highly configurable and currently Ming Fai has suggested
>> some changes for the future. I will describe the possibilities that
>> droids currently offers for the presented use case.
>>
>> Like said we start with the queue where you inject the starting urls.
>> Then this queue will call a worker (which basically is the part of the
>> code where the real work is done). This worker may call a linkExtractor
>> and/or a Parser to extract link and any other information about the
>> incoming page.
> 
> 
> 
> I think most crawler (incl. Droids and any of my suggested change) works
in
> more or less the same way. We always have URL as seeds and be put in a
> queue/list (TaskQueue in Droids),  a main component to control
multi-thread
> and execution (TaskMaster), components to fetch/retrieve the URL as
> inputstream/entity (Worker and Protocol), components to parse/process the
> inputstream/entity (Parser), components to extract outlinks
(LinkExtractor)
> and put back into the main queue/list.(Worker) Droids also has URLFilter
> that accept/reject outlinks, TaskValidator to intecept at the
> add-to-queue-time (that works similar as URLFilter for crawling, maybe u
> could ignore this), DelayTimer to slow down the fetching. The above
refers
> to the current Droids implementation. I think it covers most of the main
> concepts.
> 
> regards,
> mingfai

Re: Need 1 :

Reply via email to