[Bug-wget] request for help with wget (crawling search results of a website)

2013-11-03 Thread Altug Tekin
Dear mailing List members,

According to the website http://www.gnu.org/software/wget/ it is ok to
write emails with help requests to this mailing list. I have the following
problem:

I am trying to crawl the search results of a news website using *wget*.

The name of the website is *www.voanews.com *.

After typing in my *search keyword* and clicking search on the website, it
proceeds to the results. Then i can specify a *"to" and a "from"-date* and
hit search again.

After this the URL becomes:

http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article

and the actual content of the results is what i want to download.

To achieve this I created the following wget-command:

wget --reject=js,txt,gif,jpeg,jpg \
 --accept=html \
 --user-agent=My-Browser \
 --recursive --level=2 \
 
www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article

Unfortunately, the crawler doesn't download the search results. It only
gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..."
links and saves the articles they link to.

*It seems like he crawler doesn't check the search result links at all*.

*What am I doing wrong and how can I modify the wget command to download
the results search list links (and of course the sites they link to) only ?*

Thank you for any help...


Re: [Bug-wget] request for help with wget (crawling search results of a website)

2013-11-03 Thread Dagobert Michelsen
Hi,

Am 03.11.2013 um 09:13 schrieb Altug Tekin :
> I am trying to crawl the search results of a news website using *wget*.
> 
> The name of the website is *www.voanews.com *.
> 
> After typing in my *search keyword* and clicking search on the website, it
> proceeds to the results. Then i can specify a *"to" and a "from"-date* and
> hit search again.
> 
> After this the URL becomes:
> 
> http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article
> 
> and the actual content of the results is what i want to download.
> 
> To achieve this I created the following wget-command:
> 
> wget --reject=js,txt,gif,jpeg,jpg \
> --accept=html \
> --user-agent=My-Browser \
> --recursive --level=2 \
> 
> www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article
> 
> Unfortunately, the crawler doesn't download the search results. It only
> gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..."
> links and saves the articles they link to.
> 
> *It seems like he crawler doesn't check the search result links at all*.
> 
> *What am I doing wrong and how can I modify the wget command to download
> the results search list links (and of course the sites they link to) only ?*


You need to inspect the urls of the results and make sure to
only download these. Maybe a --no-parent is enough.


Best regards

  -- Dago


-- 
"You don't become great by trying to be great, you become great by wanting to 
do something,
and then doing it so hard that you become great in the process." - xkcd #896



smime.p7s
Description: S/MIME cryptographic signature


Re: [Bug-wget] request for help with wget (crawling search results of a website)

2013-11-03 Thread Tony Lewis
Altug Tekin wrote:

> To achieve this I created the following wget-command:
>
> wget --reject=js,txt,gif,jpeg,jpg \
>  --accept=html \
>  --user-agent=My-Browser \
>  --recursive --level=2 \
>
www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F
2013&ob=dt#article

You need to quote the URL since it contains characters that are interpreted
by your command shell. (Most likely nothing after the "&" was sent to the
web server.

I think you might run into problems with --accept since the URL does not end
with ".html" so you might need to delete that argument to get the results
you want.

Tony