On Mar 31, 2009, at 12:38 PM, Robin Howlett wrote:

Hello,

I've only really taken an introductory look at Droids and ran through the samples. I think I'll be using Droids for an upcoming project. I have a
couple of questions first:

I ran both the SimpleRuntime example and the Cli example through a site I wish to parse. Droids seems to keep an index of the links in the page to parse and those parsed already - where is that list? In memory? Is it the
queue? How big can that queue grow to?

the Simple Queue included in Droids is just an in memory ConcurrentHashMap.


The site I will be crawling will be around 500,000 pages - is this a number that could be supported? Can the index be persisted using a DB instead of
being stored in memory?


Yes, the interface is easy to implement with a DB backend:
http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-core/src/main/java/org/apache/droids/api/TaskQueue.java

When I use droids, this is what I use -- it has become too domain specific for me to give back anything too useful now. We should look into adding something into the core that persists to something -- SQL, ehcache, whatever.


Some of the links to content I wish to crawl/parse/index are JavaScript pop ups - therefore I wish to alter the url for the crawler to use; this should
be no problem right?


Should not be a problem -- if you can find the URLs from the parse data you can add them to the Queue

ryan

Reply via email to