Re: [Bug-wget] Behaviour of spanning to accepted domains

2015-06-03 Thread Tim Ruehsen
Hi Evan,

wget -rH -D $(cat trash/text.txt) williamstallings.com

is not what you want. Leave away the -H, else host-spanning is ON and -D will 
be ignored.

> I bring this up for one my two questions. Can someone recommend a better
> method of performance testing?

What you want to know is how many CPU cycles does wget need to perform a 
defined task (if you compare make sure exactly the same files are downloaded).
The measurement of the real time used depends on many time-variant side-
effects and thus two runs of wget are hardly comparable.

Use valgrind --tool=callgrind wget ...
You can use kcachegrind to display/analyse which part of wget took how many 
CPU cycles.

Regards, Tim

On Tuesday 02 June 2015 17:21:48 ekrell wrote:
> Greetings,
> 
> I recently used wget in such a way that the result disagreed with my
> understanding of what should have happened. This came about during a
> small programming exercise I am currently working on; I am attempting to
> see if a large number of domains (from '-D' option) would be processed
> more quickly by using the hashtable included in hash.c. While comparing
> the speed of my hashed implementation of host checking against an
> unmodified version of wget, the standard wget did not seem to respect my
> list of accepted domains.
> 
> For the hash table version, I did the following:
> In recur.c, I init a hashtable with all of the accepted domains from
> opt.domain. Ignoring (for the moment) increased memory usage, I assumed
> that this would surely be faster than the current method of checking the
> url's host.
> However, when performing the check inside host.c's accept_domain
> function, I realized that I would need to parse u->host to get just the
> domain component. This involves some overhead that may make hashing not
> worth it. Also, during this entire operation, I am assuming that if it
> would provide any significant improvement, it would have most likely
> been done before my decision to try it out. Nonetheless, I've enjoyed
> playing around with it.
> 
> My first couple tests were against my own website, using a list over
> 5000 domains. Both wget and wget-modified downloaded the same files, and
> at roughly the same speed. My website is so small, that I wanted
> something larger, but not so large that it would take more than a fre
> minutes. I know going around and mirroring random sites is perhaps not
> recommended behaviour (without a delay), but it worked.
> 
> I bring this up for one my two questions. Can someone recommend a better
> method of performance testing?
> 
> Having found my target website, I went ahead and ran the two wget
> versions, one after the other. When mine came out to be almost twice as
> fast, I knew to assume that something was amiss. Sure enough, wget has
> downloaded much more content.. and spanned to many more domains.
> 
> This is the command I ran for each:
> 
> /src/wget -rH -D $(cat trash/text.txt)
> williamstallings.com
> 
> Excusing the useless use of cat, text.txt contains the massive
> comma-separated list of domains.
> Each of those domains is a randomly generated numeric value, expect for
> the final one: williamstallings.com
> 
> Previously, whenever I ran this test against a (smaller) website, both
> versions of wget would only recursively download those specified by my
> single "real" domain in the list. However, this time (and I did it twice
> to make sure) original wget went on to download from over 20 other
> domains.
> 
> I would appreciate it if someone could explain what is going on here.
> Seeing as this behaviour exists with the version I obtained from
> git://git.savannah.gnu.org/wget.git as well as wget from the package
> manager, I am not proclaiming "found a bug!". I imagine that I just
> misunderstand what should have taken place, since I expected to only
> have the single directory from williamstallings.com
> 
> Thanks,
> Evan Krell


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Behaviour of spanning to accepted domains

2015-06-03 Thread ekrell

Hello Tim,

Thank you so much for your recommendation of valgrind/kcachegrind. I 
know that there is a significant amount of variable overhead, so this is 
exactly what I was looking for. I've only ever used valgrind for 
memory-related analysis; I was intending to use it to see the impact of 
a large hashtable on memory.


While I believe you regarding the choice of options, from what I can 
tell the wget documentation differs, stating:


"The ‘-D’ option allows you to specify the domains that will be 
followed, thus limiting the recursion only to the hosts that belong to 
these domains. Obviously, this makes sense only in conjunction with 
‘-H’."


With the example: " wget -rH -Dserver.com http://www.server.com/ "

Well, the way I have my hosts list set up, I end up "canceling out" the 
two options since only the original domain is accepted, but that is 
because I only want to see how long it takes to realize that, not 
actually perform spanning.


Thanks,
Evan



On 2015-06-03 02:18, Tim Ruehsen wrote:

Hi Evan,

wget -rH -D $(cat trash/text.txt) williamstallings.com

is not what you want. Leave away the -H, else host-spanning is ON and 
-D will

be ignored.

I bring this up for one my two questions. Can someone recommend a 
better

method of performance testing?


What you want to know is how many CPU cycles does wget need to perform 
a
defined task (if you compare make sure exactly the same files are 
downloaded).
The measurement of the real time used depends on many time-variant 
side-

effects and thus two runs of wget are hardly comparable.

Use valgrind --tool=callgrind wget ...
You can use kcachegrind to display/analyse which part of wget took how 
many

CPU cycles.

Regards, Tim

On Tuesday 02 June 2015 17:21:48 ekrell wrote:

Greetings,

I recently used wget in such a way that the result disagreed with my
understanding of what should have happened. This came about during a
small programming exercise I am currently working on; I am attempting 
to

see if a large number of domains (from '-D' option) would be processed
more quickly by using the hashtable included in hash.c. While 
comparing

the speed of my hashed implementation of host checking against an
unmodified version of wget, the standard wget did not seem to respect 
my

list of accepted domains.

For the hash table version, I did the following:
In recur.c, I init a hashtable with all of the accepted domains from
opt.domain. Ignoring (for the moment) increased memory usage, I 
assumed
that this would surely be faster than the current method of checking 
the

url's host.
However, when performing the check inside host.c's accept_domain
function, I realized that I would need to parse u->host to get just 
the
domain component. This involves some overhead that may make hashing 
not

worth it. Also, during this entire operation, I am assuming that if it
would provide any significant improvement, it would have most likely
been done before my decision to try it out. Nonetheless, I've enjoyed
playing around with it.

My first couple tests were against my own website, using a list over
5000 domains. Both wget and wget-modified downloaded the same files, 
and

at roughly the same speed. My website is so small, that I wanted
something larger, but not so large that it would take more than a fre
minutes. I know going around and mirroring random sites is perhaps not
recommended behaviour (without a delay), but it worked.

I bring this up for one my two questions. Can someone recommend a 
better

method of performance testing?

Having found my target website, I went ahead and ran the two wget
versions, one after the other. When mine came out to be almost twice 
as

fast, I knew to assume that something was amiss. Sure enough, wget has
downloaded much more content.. and spanned to many more domains.

This is the command I ran for each:

/src/wget -rH -D $(cat trash/text.txt)
williamstallings.com

Excusing the useless use of cat, text.txt contains the massive
comma-separated list of domains.
Each of those domains is a randomly generated numeric value, expect 
for

the final one: williamstallings.com

Previously, whenever I ran this test against a (smaller) website, both
versions of wget would only recursively download those specified by my
single "real" domain in the list. However, this time (and I did it 
twice

to make sure) original wget went on to download from over 20 other
domains.

I would appreciate it if someone could explain what is going on here.
Seeing as this behaviour exists with the version I obtained from
git://git.savannah.gnu.org/wget.git as well as wget from the package
manager, I am not proclaiming "found a bug!". I imagine that I just
misunderstand what should have taken place, since I expected to only
have the single directory from williamstallings.com

Thanks,
Evan Krell





Re: [Bug-wget] Behaviour of spanning to accepted domains

2015-06-03 Thread Tim Ruehsen
Hi Evan,

> While I believe you regarding the choice of options, from what I can
> tell the wget documentation differs, stating:
> 
> "The ‘-D’ option allows you to specify the domains that will be
> followed, thus limiting the recursion only to the hosts that belong to
> these domains. Obviously, this makes sense only in conjunction with
> ‘-H’."

This has already been fixed to:

"Set domains to be followed.  domain-list is a comma-separated list of 
domains.  Note that it does not turn on -H."

Regards, Tim


signature.asc
Description: This is a digitally signed message part.