Inspectre, Here is a slight modification that may suit your needs:
https://play.golang.org/p/dppJOkPcvG Change summary: 1. Added a terse option (‘-t’) so I could see the urls and errors only. 2. Changed the URL queue to a channel so that… 3. …the main loop (lines 139–148) would be clearer (matches Ian’s suggestion) 4. Added a concurrency-limit to as not to overwhelm websites and local open-file limits 5. Answered the “how do you know when to stop” question in one of several logical ways. My answer for #5 was to consider that the processing was done when the initial root URLs had been processed—that is, when the dataRouter call had finished. Each one of these, in turn, is only done when it’s next-level dataRouter call is finished. So that sets up the structure that is necessary and my choice of implementation was to split it this way: 5a. Create a global URLsInProcess wait group 5b. Increment it as each new URL is added in addURL 5c. Decrement it only after each URL is completely processed in dataRouter 5d. Wait for it to reach zero in main This structure means that the program will wait as long as necessary for each URL to complete. Each is a commitment. The “now completed” logic of (5c) is in a defer so it runs in error cases. However, this makes the program dependent on every attempt completing somehow, so if the URL Get() should fail to time out on a server that dies, the program will hang. Not a likely problem, but not as robust as you would want in production. Another weakness to consider is that the throttling here (16 concurrent connections) really has two parts. The first is how many make sense on your computer based on OS and process limits as well as numbers of CPUs. The second is how many connections a remote server will tolerate before it decides you are a DoS threat. If you were to implement multiple roots, or follow links off-site, then it would make sense to separate these two limits—expansive local resources could be split across multiple destination servers to keep the per-remote-server load at a friendly limit.* Finally, I did not change the logic of the program to use a worker pool. It still spawns a goroutine for every URL visited. That is not bad, but neither is it necessary. Making this change would have made changes 1–5 less clear, and the goal here is clarity. Hope this helps, Michael Michael Jones michael.jo...@gmail.com * Which is important at Google, for example, since nobody would want 20k+ Google servers to simultaneously crawl their website, no matter how eager they are to be included in the index. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.