I have a rough idea about what you are concerning. Yes, DFS is usually
implemented as recursion and there must be stack overflow problem for
any algorithms using stacks to store search paths. But I never think
about crawling the Web as an application of DFS. But it is, I have to
say.

To me, if you set a maximum stack level as 10, then you sort like
creating a new graph traversal problem combining DFS and BFS. Once you
come to level 10, you stop going deeper. Now you need to store the
current node somewhere and you need to hash it. Then you backtrack to
another possible node and keep on searching. You may come to a node
twice, since you can not color the nodes, you need a way to know
whether you come to the node before (from some other paths) or not. If
not, go deeper (not exceeding level 10 of course).

You can come to a time that the subgraph (whose longest path is
smaller than 10) from your starting node is fully explored. And all
the leaf nodes should be explored further. You can put these nodes in
a queue let say (this is why I say the algorithm has flavor as BFS),
and you need to choose a node from the queue and treat it as your
starting node and do the DFS. Keep on doing this until you crawl the
whole Web theoretically.

This is for single machine. For multiple machines, you need to have
two types of clusters. One clusters for crawling the Web and the other
clusters for hashing the explored nodes. Of course the second cluster
can be smaller than the first one since you do not need many machines
to do the hashing. But the second cluster may need larger memory than
individual machine in the first cluster. Each time a machine in the
first cluster finds a node with level as 10, it stops doing deeper
from that node and put it in a public pool (can be a queue) and
backtracks. Once the machine fully explores a subgraph, it can visit
the pool and get a new node, do the DFS rooted by that node again. All
machines in this cluster do the same thing. They need to query
"whether the node is explored or not?" from the hashing cluster.

This is the organization I can think up to crawl the Web using DFS +
BFS. Hashing function which can answer the question "Whether a web
page is explored before by some machine or not?" is critical. Of
course URL+explored time can be a hash key. The URL can have several
levels and you can assign each level to a subcluster in the hashing
cluster. Treat the URL as a string and explored time as a time stamp,
the hashing function can be quite complex. I am not expert in Hashing
function but I feel this part is quite important for any sophisticate
web crawler system.

That's all my idea.

On Sep 29, 3:51 am, Summercool <[EMAIL PROTECTED]> wrote:
> I just wonder if for DFS, we don't really need an array to act as a
> Stack (LIFO), but we can just use recursion.
>
> Is it true that the main concern is stack overflow?  (if too deep
> level).  If we use the array as a stack, then the overflow problem is
> gone.  Now, however, if our program is to spider the web, and we set
> the levelMax to 10, then the stack overflow problem is really not
> there, as the deepest call stack level is 10.
>
> Is it true that for BFS, you always have to use an array as a Queue
> (FIFO)?  Using an array seems a bit like a "global variable".  Is it
> possible at all to do BFS without using a queue and just use recursion?


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

Reply via email to