By the way, if you implement DFS with non-recursion style, that means,
you implement stack by yourself. Generally you can create the stack in
heap memory space rather than in stack memory space. This will give
you larger capability of DFS. But for tasks like crawling the Web,
this still can easily fulfill your memory soon or later. For the above
solution, I did not say anything about the hashing part. This may not
fit in the memory as well. Fortunately, good data structures which
resides in secondary disks and allows fast querying for existing are
on hand, such as B tree I guess. Refer to some Information Retrieval
or Database or data structure textbooks to get the right one suits
you.

To me, it is never easy to build a spider that can crawl the
(nontrivial) Web.

On Oct 1, 12:44 pm, Sticker <[EMAIL PROTECTED]> wrote:
> I have a rough idea about what you are concerning. Yes, DFS is usually
> implemented as recursion and there must be stack overflow problem for
> any algorithms using stacks to store search paths. But I never think
> about crawling the Web as an application of DFS. But it is, I have to
> say.
>
> To me, if you set a maximum stack level as 10, then you sort like
> creating a new graph traversal problem combining DFS and BFS. Once you
> come to level 10, you stop going deeper. Now you need to store the
> current node somewhere and you need to hash it. Then you backtrack to
> another possible node and keep on searching. You may come to a node
> twice, since you can not color the nodes, you need a way to know
> whether you come to the node before (from some other paths) or not. If
> not, go deeper (not exceeding level 10 of course).
>
> You can come to a time that the subgraph (whose longest path is
> smaller than 10) from your starting node is fully explored. And all
> the leaf nodes should be explored further. You can put these nodes in
> a queue let say (this is why I say the algorithm has flavor as BFS),
> and you need to choose a node from the queue and treat it as your
> starting node and do the DFS. Keep on doing this until you crawl the
> whole Web theoretically.
>
> This is for single machine. For multiple machines, you need to have
> two types of clusters. One clusters for crawling the Web and the other
> clusters for hashing the explored nodes. Of course the second cluster
> can be smaller than the first one since you do not need many machines
> to do the hashing. But the second cluster may need larger memory than
> individual machine in the first cluster. Each time a machine in the
> first cluster finds a node with level as 10, it stops doing deeper
> from that node and put it in a public pool (can be a queue) and
> backtracks. Once the machine fully explores a subgraph, it can visit
> the pool and get a new node, do the DFS rooted by that node again. All
> machines in this cluster do the same thing. They need to query
> "whether the node is explored or not?" from the hashing cluster.
>
> This is the organization I can think up to crawl the Web using DFS +
> BFS. Hashing function which can answer the question "Whether a web
> page is explored before by some machine or not?" is critical. Of
> course URL+explored time can be a hash key. The URL can have several
> levels and you can assign each level to a subcluster in the hashing
> cluster. Treat the URL as a string and explored time as a time stamp,
> the hashing function can be quite complex. I am not expert in Hashing
> function but I feel this part is quite important for any sophisticate
> web crawler system.
>
> That's all my idea.
>
> On Sep 29, 3:51 am, Summercool <[EMAIL PROTECTED]> wrote:
>
> > I just wonder if for DFS, we don't really need an array to act as a
> > Stack (LIFO), but we can just use recursion.
>
> > Is it true that the main concern is stack overflow?  (if too deep
> > level).  If we use the array as a stack, then the overflow problem is
> > gone.  Now, however, if our program is to spider the web, and we set
> > the levelMax to 10, then the stack overflow problem is really not
> > there, as the deepest call stack level is 10.
>
> > Is it true that for BFS, you always have to use an array as a Queue
> > (FIFO)?  Using an array seems a bit like a "global variable".  Is it
> > possible at all to do BFS without using a queue and just use recursion?


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

Reply via email to