Inspectre,

Here is a slight modification that may suit your needs: 

        https://play.golang.org/p/dppJOkPcvG

Change summary:

1. Added a terse option (‘-t’) so I could see the urls and errors only.
2. Changed the URL queue to a channel so that…
3. …the main loop (lines 139–148) would be clearer (matches Ian’s suggestion)
4. Added a concurrency-limit to as not to overwhelm websites and local 
open-file limits
5. Answered the “how do you know when to stop” question in one of several 
logical ways.

My answer for #5 was to consider that the processing was done when the initial 
root URLs had been processed—that is, when the dataRouter call had finished. 
Each one of these, in turn, is only done when it’s next-level dataRouter call 
is finished. So that sets up the structure that is necessary and my choice of 
implementation was to split it this way:

5a. Create a global URLsInProcess wait group
5b. Increment it as each new URL is added in addURL
5c. Decrement it only after each URL is completely processed in dataRouter
5d. Wait for it to reach zero in main

This structure means that the program will wait as long as necessary for each 
URL to complete. Each is a commitment. The “now completed” logic of (5c) is in 
a defer so it runs in error cases. However, this makes the program dependent on 
every attempt completing somehow, so if the URL Get() should fail to time out 
on a server that dies, the program will hang. Not a likely problem, but not as 
robust as you would want in production.

Another weakness to consider is that the throttling here (16 concurrent 
connections) really has two parts. The first is how many make sense on your 
computer based on OS and process limits as well as numbers of CPUs. The second 
is how many connections a remote server will tolerate before it decides you are 
a DoS threat. If you were to implement multiple roots, or follow links 
off-site, then it would make sense to separate these two limits—expansive local 
resources could be split across multiple destination servers to keep the 
per-remote-server load at a friendly limit.*

Finally, I did not change the logic of the program to use a worker pool. It 
still spawns a goroutine for every URL visited. That is not bad, but neither is 
it necessary. Making this change would have made changes 1–5 less clear, and 
the goal here is clarity.

Hope this helps,
Michael

Michael Jones
michael.jo...@gmail.com

* Which is important at Google, for example, since nobody would want 20k+ 
Google servers to simultaneously crawl their website, no matter how eager they 
are to be included in the index.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to