Re: some proposed ideas for Droids

Ryan McKinley Mon, 22 Jun 2009 14:23:47 -0700

For background, I am not (yet) using droids for web crawling --rather, I use it to manage a bunch of jobs that keep externalprocesses running. It is easy to equate droids with crawling, but Ithink that is one of many functions (though obviously the mostgenerally relevant)



On Jun 22, 2009, at 11:40 AM, Mingfai wrote:

hi,

I am some proposed idea for discussion. Some of them are designprinciples

or concept, and some are more concrete points about design on specific
items. The points are as follows:

* - indicate items that I consider as major changes
**

  1. Use of Java Package
     - do not use a org.apache.droids.api package. Just put the API
     interface in individual package.


+0  (sure... i have not strong opinion)

- do not use *.impl unless there are 3 or more. (i don't reallyhave
     strong opinion at this point, just don't like to have some impl
package with
     just one or two class(es))

+0

     2. General coding practice/standard
- use protected instead of private by default. This will allowusers
     to extend and replace any of our class easier.
- do not use final for LinkTask/Link. I have a use case that Iwant toextend Link/LinkTask as a JPA Entity Bean. And JPA doesn'tallow a classwithout default constructor, i.e. we can't final the URL field.For theclasses that are not expected to be subclassed by users, it'sok to use
     final.

     - Use Java standard interface rather than introducing our own
     interface unless there are obvious value-added. e.g. replace
TaskQueue with
     java.util.Queue<Task>

+1

     - *Use Spring for droids-core
  - i.e. allow droids-core depends on Spring

- a droid needs a whole object graph to work. as aframework, we

        want different components be configurable. It's better to
rely on an IoC
        framework to manage the dependency and configuration. Spring
is the most
        popular IoC framework. it will also make testing much easier.
and user needs

not to change code (but to change xml) if they want tochange certain

        behavior of our core classes.

- there are some special benefits of using spring, e.g. itsupports

        annotation and autowiring. Take an example, I could define a
field like
        "@Autowired Collection<Filter> filters", and when I add a new
filter, then,
        i create 2 classes like "@Component public class XXXFilter",
and in runtime,
        the filters field will be injected with a collection of the 2
classes. It
        makes development and configuration real simple. (and there
are also ways to
        change the autowiring behavior)

- Spring facilitates the use of http remoting. And it iseasy toreplace implementation class to do remoting or otherinterception.

Does the *core* really need access to the whole object graph? Itotally agree that most specific implementations will need broaderaccess.

I think droids power will come from its flexibility / simplicity.Ideally the *core* will have as few dependencies as possible.

I agree that sub-project/package that focuses on web crawling coulddepend on spring.

3. Specific concept/component. Some of the points in thissection
  are just my comment to the concept, but not any proposal for action.
  - Droids/Crawler
- Our top level concept. We only use Droid but not Crawler.I use
        the generic term Crawler in this message.

Ya -- the term "droid" was intentionally chosen so that it representsthe larger concept of a robot doing something. Crawling is oneinstance of what it may do. (again, likely the most broadly used)

        - Link, LinkTask, Task

- Task is a valid concept. A task is the unit that work by theworker.

        A link refer to the link only. I have no objection to this
concept. But in

implementation, it seems there is no much need to implementthe Task

        concept. (naming an interface as "nextTask" is ok but it
seems no need to
        have a class or interface called "Task")

- a crawler works with links.and we don't normally non-linkrelated

        task that goes beyond the scope of a crawler framework.

        - See the Link-centric design bullet for more info about link

        - Fetcher

- we do not use the concept of Fetcher now. I suppose it isbecause

        Droids is designed to do more than web crawling and non-web
resource is not
        "fetched"? In Droids 0.01, Protocol basically represent the
Fetcher (or at
        least, Protocol+Worker)

+1 -- I think a Fetcher concept is a good idea. It should also beindependent of the task interface. When thinking about the fetcher,it may be good to consider VFS (http://commons.apache.org/vfs/) as afirst class implementation.

- I strongly think we should use the term and concept ofFetcherbecause it is a common terminology in crawler. Using commonterms and
        language makes our design more intuitive.
- Parser, Handler, Processor, Extractor etc. these are termsthatshare very similar meaning. No matter how we use them, we needto give astrict definition, e.g. in class level JavaDoc comment. Mysuggestions:
        - Parser - the component that process the raw fetched Entity.
        Output data is subject to implementation. One Entity will be
parsed by one
        parser only.

        - Extractor - the component to extract out link from entity.
        Multiple extractors could be used for a parser. the primary
function is to
        extract out link. user may also use it to do other extraction
or operation,
        e.g. to store data in the Link, or just consume the parsed
data. A extract
        depends directly to a parser. (we can't easily define a
contract between
        Parser and Extractor, so let's do not attempt this.)
Extractor is a new concept. It is splitted from Parser anddiff ina way that each link shall be parsed once, and multipleextractor may
        perform extraction or custom operation against the parsed
data. I think i
        mentioned in another email before. Say when we use
NekoHtmlParser, we want
        to parse just once, and maybe extractor1 is for extracting
outlink, and
        extractor2 is for custom processing (such as indexing) and
both are based on
        the same parsed data.
- Processor - too vague. do not use this concept, and we arenot
        using it anyway.

        - Handler - for event based Parser, it may use a SAX
DefaultHandler. To avoid confusion, let's not to use handlerin other
        context.


agree.

"Parser" and "Extractor" make sense,

"Processor" and "Handler" are not clear to me. I know they each havefunctions that can be reused by each other, but the general terms getconfusing.


        - Entity
        - I don't have any strong proposal and this section is just to
        brainstorm some ideas. It would be good to clarify what we
want to achieve

in providing an Entity hierarchy, given that, the HC projectactuallyprovides all the Entity already. Our entity is kind of awrapper with

        buffer.(and HC also have buffered entity) I guess we don't
want to depend on
        HttpEntity from HC directly as Droids may touch entity beyond
HttpEntity,
        e.g. File (but HC also as FileEntity...)

- Entity is the contract between Fetcher/Protocol andParser. For

        Entity, it's unlikely the user need to subclass it. if they
need to subclass
        it, they also need to implement a Fetcher/Protocol + Parser.
The value of
        sub-classing Entity is not significant. I suggest we just not
design it for
        subclassing.

        - Currently, we have a hierarchy of ManagedContentEntity,
        ContentEntity, FileContentEntity, HttpContentEntity. For a
file parser and a
        http parser, they can't easily use a common Entity interface
and I suppose
        the parser implementation has to cast the entity. To me,

"ManagedContentEntity" doesn't give a lot of meaning than"Entity".

        FileEntity and HttpEntity does make a different to me in
concept, but i
        don't see how they could be related in implementation.

- My initial thought about the contract of parse is whetherit can

        just take a InputStream. And later i find it is necessary to
have a concept
        of Entity that hold information like content/mime type,
encoding/charset,
        size/length etc. But diff kind of entity just may have
different attribute
        and it's not easy to define a comment contract. One of the
ideas in my mind
        is to use a single final Entity class that extend HashMap.

        - For HttpEntity, i do prefer to have a way to retrieve the
        original HC HttpEntity object. (but it is unlikely we want to
expose that in

any interface) Notice that the wrapping make it a bit morecomplex in

        constructing instance in unit test.

I will defer to others on the Entity discussion... I am not reallyfamiliar with the concepts

        - Worker, Task, TaskMaster

- Make worker implements Runnable, Future.(and notRunnableFuture

        for JDK5 compataibility) and we use run() as its main
interface. So it could
        be use as a thread easier.


sure

        - I suggest to remove the concept of Task and TaskMaster. A
        Droid/Crawler could do most work of the TaskMaster. These
concepts also
        confuse with Thread, ThreadFactory, Executor, that creates
many similar
        concepts.

maybe -- right now, the Droid interface just handles initializationand callbacks from the TaskMaster. It seems like that is asubstantially different concept then keeping a bunch of processesrunning tasks.

        - if we keep TaskMaster, i suggest to make it implements
        ExecutorService, and we depends on Java util/concurrency API
rather than a
        TaskMaster interface.


seems good.

        - Queue
        - I suggest to remove the TaskQueue interface and use Queue<?
        extends Link> as standard signature.

+1

        4. Link-centric design
- Link, extends HashMap, will act as a main arbitary datacontainer, anda vehicle that store attributes and data thoughout the wholelifecycle of
     fetching, parsing, and extracting.

I don't have any strong opinion here.... but I would rather see an APIwhere we can rely on method calls then putting stuff into a Map --perhaps years of dealing with request.getAttribute() has turned mesour on this model.

- if we do it extremely, all data can be stored as in the Linkand allinterface could just use a single Link argument, e.g.parse(Link),
     extract(Link). For sure it is not a good idea. So i make every
interface to
include the Link argument as well as key component. I found theextremeusage is good in remote web service api, but not good in JavaAPI.
     - All components to be generic as <? extends Link>, user may use
     another Link implementation for the whole Droid operation. an
example is a
     WeightedLink.

     - For a Link, only the URL is mandatory. A ID is needed for
     implementing an in-memory set/hashtable to reject duplicated
Link quickly. I
suggest to make Link a class so people could create a link withnew Link("
     http://www.apache.org";) easily, just like creating URL or URI.

     5. Non-thread safe interface and fluent API
- take an example, insteaad of "Parse parse()", i suggest theparser tostore the parsed data inside itself, and we provide a reset()method to
     clear the data for re-use. This design has pro and con.
- one of main pro is, we could simplify the model by omitting aParseclass that is mainly for holding arbitrary data. And we alsocan't easilydefine the return type of an interface. Take Fetcher as anexample, afetcher typically contain a Request and Response object. Shouldwe have
     fetch() to return a FetchedData that has request, response, and
entity? it's
     just a bit complex.
- I hope no one against Fluent API :-). with fluent api, it'slike
     "public Fetcher fetch()". And I don't always use Fluent Api,
only use when
it is good and the api call may be chained. e.g.parser.parse().getDate()
     6. Factory and LinkMatcher design
     - For worker, fetcher and parser, they are provided by users as a
     Factory.
- For FetcherFactory and ParserFactory, new instance arecreated with
     a newXXX(Link link)
- So, depends on the Link, the Factory will provide diffcomponents.
     e.g. for http link, it's a HttpFetcher, for File, it's a file
fetcher (not
     impl.) for parer, it consults the content type.
- Every component implemnets a LinkMatcher interface thatchecks if aFecher/Parser/Extractor supports a particular link. This isprimary forautomatic component registration without a need to explicitlyproviding a
     mapping upfront. e.g. there might be a PNG parser that checks the
"contentType" attribute of the Link. The parser implemented thematches()
     method. so we don't need to maintain a mapping hashmap between
contentType
     and parser. The link matching may be complex so it's hard to use
a mapping
     hashmap anyway. together with the filter framework, any
attribute could be
     prepared by a filter first, so the factory could always rely on
the matcher
     interface to find the correct parser/fetcher.


no real opinion -- everything sounds reasonable.

     7. *Filter Framework
     - This is a significant new concept. I propose to have a filter

framework that works as a chain for intercepting the works ofevery main

     component. There are a main lifecycle filter that is named
Filter, and also
     individual component filters such as FetchFilter and
ParseFilter. Lifecycle

filter is called by a Worker. Some works may not support it,e.g. my

     WebServiceWorker that call GAE service do the whole batch of
     fetch->parser->extract in one go, so there is no local filter.
Normal worker
     shall call every filter after every operation. If the filter
return null, it
     stop processing the link

     - Filter

- When we have a confirmed lifecycle, e.g. poll a link fromqueue

        -> fetch entity -> parse entity -> extract outlinks, then we
have a filter
        that allow inteception in between every stage. e.g.
        public Link polled(Link link)
        public Fetcher fetched(Link link, Fetcher fetcher)

        - any filter may influence the flow by changing the component
        object like Fetcher/Parser, or they may return null and the
        Worker/TaskMaster shall stop the process for that link.
        - e.g. Duplicated Link handling could be done as a Filter. a

singleton NoRepeatFilter stores a Set of Link ID, and whenany link is

        extracted, it is check against the set and dupliated link
will be removed.
        - It offers a lot of potentially such as providing runtime
        statistics.

        - Component filter
        - e.g. FetchFilter, public void preFetch(Link, Fetcher),
        postFetch(Link,Fetcher)

- component filter is expected to alter fetching behavior.e.g. for

        preFetch for a http fetcher, the http request shall be
available already,
        and the preFetch could as the fetcher to the concrete class,
and modify the
        content of the HttpRequest before it is executed. e.g. to
append http header
        / cookie depends on any attribute in the Link.

- The global / lifecycle filter do filter after everycomponent

        operations. they are designed for different purpose.


sounds good

        8. Removed concepts based on the above proposal
  - LinkTask, Task, just keep Link
- TaskMaster - with some refactor to assign responsibility toDroidsand Worker, the TaskMaster doesn't do too many things, and Isuggest to
     remove this concept.
     - TaskQueue - just use Queue<Link>
- TaskQueueWithHistory - this is eliminated by the filterframework.
     See the next section.
- TaskValidator - eliminate by the filter framework / implementas a
     Filter
     - URL Filter - could be implemented as a Filter
     - Parse(Parse) - merged into Parser, I think "Parse" is a vague
concept and we would rather to have a Map return from than havea Parse
     class

ya.


any comment?

In general sounds good. As for flushing out large changes like this-- i think we should discuss it a bit more to make sure everyone is onthe same page. Then it probably makes sense to start a personalsandbox:

http://svn.apache.org/repos/asf/incubator/droids/sandbox/mingfa

where we can see some things in action and then look at migratingthings together.



ryan

Re: some proposed ideas for Droids

Reply via email to