Hi, Am 03.11.2013 um 09:13 schrieb Altug Tekin <altugteki...@gmail.com>: > I am trying to crawl the search results of a news website using *wget*. > > The name of the website is *www.voanews.com <http://www.voanews.com>*. > > After typing in my *search keyword* and clicking search on the website, it > proceeds to the results. Then i can specify a *"to" and a "from"-date* and > hit search again. > > After this the URL becomes: > > http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article > > and the actual content of the results is what i want to download. > > To achieve this I created the following wget-command: > > wget --reject=js,txt,gif,jpeg,jpg \ > --accept=html \ > --user-agent=My-Browser \ > --recursive --level=2 \ > > www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article > > Unfortunately, the crawler doesn't download the search results. It only > gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..." > links and saves the articles they link to. > > *It seems like he crawler doesn't check the search result links at all*. > > *What am I doing wrong and how can I modify the wget command to download > the results search list links (and of course the sites they link to) only ?*
You need to inspect the urls of the results and make sure to only download these. Maybe a --no-parent is enough. Best regards -- Dago -- "You don't become great by trying to be great, you become great by wanting to do something, and then doing it so hard that you become great in the process." - xkcd #896
smime.p7s
Description: S/MIME cryptographic signature