Re: [Bug-wget] Behaviour of spanning to accepted domains
Hello Tim, Thank you so much for your recommendation of valgrind/kcachegrind. I know that there is a significant amount of variable overhead, so this is exactly what I was looking for. I've only ever used valgrind for memory-related analysis; I was intending to use it to see the impact of a large hashtable on memory. While I believe you regarding the choice of options, from what I can tell the wget documentation differs, stating: "The ‘-D’ option allows you to specify the domains that will be followed, thus limiting the recursion only to the hosts that belong to these domains. Obviously, this makes sense only in conjunction with ‘-H’." With the example: " wget -rH -Dserver.com http://www.server.com/ " Well, the way I have my hosts list set up, I end up "canceling out" the two options since only the original domain is accepted, but that is because I only want to see how long it takes to realize that, not actually perform spanning. Thanks, Evan On 2015-06-03 02:18, Tim Ruehsen wrote: Hi Evan, wget -rH -D $(cat trash/text.txt) williamstallings.com is not what you want. Leave away the -H, else host-spanning is ON and -D will be ignored. I bring this up for one my two questions. Can someone recommend a better method of performance testing? What you want to know is how many CPU cycles does wget need to perform a defined task (if you compare make sure exactly the same files are downloaded). The measurement of the real time used depends on many time-variant side- effects and thus two runs of wget are hardly comparable. Use valgrind --tool=callgrind wget ... You can use kcachegrind to display/analyse which part of wget took how many CPU cycles. Regards, Tim On Tuesday 02 June 2015 17:21:48 ekrell wrote: Greetings, I recently used wget in such a way that the result disagreed with my understanding of what should have happened. This came about during a small programming exercise I am currently working on; I am attempting to see if a large number of domains (from '-D' option) would be processed more quickly by using the hashtable included in hash.c. While comparing the speed of my hashed implementation of host checking against an unmodified version of wget, the standard wget did not seem to respect my list of accepted domains. For the hash table version, I did the following: In recur.c, I init a hashtable with all of the accepted domains from opt.domain. Ignoring (for the moment) increased memory usage, I assumed that this would surely be faster than the current method of checking the url's host. However, when performing the check inside host.c's accept_domain function, I realized that I would need to parse u->host to get just the domain component. This involves some overhead that may make hashing not worth it. Also, during this entire operation, I am assuming that if it would provide any significant improvement, it would have most likely been done before my decision to try it out. Nonetheless, I've enjoyed playing around with it. My first couple tests were against my own website, using a list over 5000 domains. Both wget and wget-modified downloaded the same files, and at roughly the same speed. My website is so small, that I wanted something larger, but not so large that it would take more than a fre minutes. I know going around and mirroring random sites is perhaps not recommended behaviour (without a delay), but it worked. I bring this up for one my two questions. Can someone recommend a better method of performance testing? Having found my target website, I went ahead and ran the two wget versions, one after the other. When mine came out to be almost twice as fast, I knew to assume that something was amiss. Sure enough, wget has downloaded much more content.. and spanned to many more domains. This is the command I ran for each: /src/wget -rH -D $(cat trash/text.txt) williamstallings.com Excusing the useless use of cat, text.txt contains the massive comma-separated list of domains. Each of those domains is a randomly generated numeric value, expect for the final one: williamstallings.com Previously, whenever I ran this test against a (smaller) website, both versions of wget would only recursively download those specified by my single "real" domain in the list. However, this time (and I did it twice to make sure) original wget went on to download from over 20 other domains. I would appreciate it if someone could explain what is going on here. Seeing as this behaviour exists with the version I obtained from git://git.savannah.gnu.org/wget.git as well as wget from the package manager, I am not proclaiming "found a bug!". I imagine that I just misunderstand what should have taken place, since I expected to only have the single directory from williamstallings.com Thanks, Evan Krell
[Bug-wget] Behaviour of spanning to accepted domains
Greetings, I recently used wget in such a way that the result disagreed with my understanding of what should have happened. This came about during a small programming exercise I am currently working on; I am attempting to see if a large number of domains (from '-D' option) would be processed more quickly by using the hashtable included in hash.c. While comparing the speed of my hashed implementation of host checking against an unmodified version of wget, the standard wget did not seem to respect my list of accepted domains. For the hash table version, I did the following: In recur.c, I init a hashtable with all of the accepted domains from opt.domain. Ignoring (for the moment) increased memory usage, I assumed that this would surely be faster than the current method of checking the url's host. However, when performing the check inside host.c's accept_domain function, I realized that I would need to parse u->host to get just the domain component. This involves some overhead that may make hashing not worth it. Also, during this entire operation, I am assuming that if it would provide any significant improvement, it would have most likely been done before my decision to try it out. Nonetheless, I've enjoyed playing around with it. My first couple tests were against my own website, using a list over 5000 domains. Both wget and wget-modified downloaded the same files, and at roughly the same speed. My website is so small, that I wanted something larger, but not so large that it would take more than a fre minutes. I know going around and mirroring random sites is perhaps not recommended behaviour (without a delay), but it worked. I bring this up for one my two questions. Can someone recommend a better method of performance testing? Having found my target website, I went ahead and ran the two wget versions, one after the other. When mine came out to be almost twice as fast, I knew to assume that something was amiss. Sure enough, wget has downloaded much more content.. and spanned to many more domains. This is the command I ran for each: /src/wget -rH -D $(cat trash/text.txt) williamstallings.com Excusing the useless use of cat, text.txt contains the massive comma-separated list of domains. Each of those domains is a randomly generated numeric value, expect for the final one: williamstallings.com Previously, whenever I ran this test against a (smaller) website, both versions of wget would only recursively download those specified by my single "real" domain in the list. However, this time (and I did it twice to make sure) original wget went on to download from over 20 other domains. I would appreciate it if someone could explain what is going on here. Seeing as this behaviour exists with the version I obtained from git://git.savannah.gnu.org/wget.git as well as wget from the package manager, I am not proclaiming "found a bug!". I imagine that I just misunderstand what should have taken place, since I expected to only have the single directory from williamstallings.com Thanks, Evan Krell
Re: [Bug-wget] [PATCH] Trust on first use
If nobody has complains about this change, we should add some documentation, (possibly a test?), and rewrite the commit message to list the changes using a ChangeLog style. Giuseppe If this gets accepted, I would love the opportunity to write the corresponding test. Unless it would be preferred that the patch author handle it. Krell
[Bug-wget] Another GSoC Student
Hello all, I just found out about Google Summer of Code three days ago. I have always wanted to contribute to open source, but have thus far only contributed to very minor projects. In particular, I want to eventually be able to contribute to the Linux kernel. Over the semester, I have been studying OS design in class and Linux in particular in my spare time. However, it is such a massive entity that I think it makes more sense to start with something more approachable. I am excited about GSoC as it looks to be a very good introduction to getting involved with open source. Wget is something that I use all the time, and I knew that I should look into this when I saw that it was a GSoC participating project. While I have quite a bit of programming experience in personal and university projects, it is quite another thing to code for a significant and established program. I feel that this would be an important next step for my development as a programmer. I believe that I am not as experienced as some applicants, but, encouraged by the "Am I Good Enough?" page, decided that it can't hurt to get involved. Regardless of acceptance for GSoC, I intend to work on patches and contribute in general. I spent the majority of yesterday reading and writing code in wget. I've compiled, run the tests, and worked on patches. I wrote 2 minor patches, but found afterward that they had already been submitted by other applicants. Also, in order to really understand how wget works, I have been implementing gopher support. There may not be any demand for such a feature, but by playing with it I have been learning how the existing protocols are implemented and how to code in such a way that makes sense with the existing infrastructure, writing test cases, etc. Additionally, I just like gopher sites. I know that I need to come up with a project idea though soon; the testing FTP server intrigues me. Thanks for joining up with GSoC, Evan Krell (GillSans on irc)