Re: [PLUG] playing around with the wget command finally worked

John Moon Mon, 20 Nov 2023 15:47:17 -0800

On 11/20/2023 2:33 PM, wes wrote:

I imagine the intention of the robots file (in this case set to disallow
all "automated" requests) is to reduce web crawler traffic.


what's ironic is that the worst offenders already ignore it.

-wes

On Mon, Nov 20, 2023 at 2:32 PM American Citizen <[email protected]>
wrote:

I am making a good faith effort to contact the site administrators. What
is ironic is that anyone can use the save page command in the standard
browser tools and get the file that way without asking at all.

On 11/20/23 13:58, American Citizen wrote:

At the risk of being blocked by the Skalfti website, I found that the
following wget command grabs one and only one file

%wget -r -A 'index.js' -e robots=off -O index.js https://vafri.is/quake/

Notice that I had to give the file a name using the -O option, and it
is stored in the current working directory.

I read that using the option -e robots=off is considered rude.. is
that generally so?

Thanks for bearing with me on this question, as this is the very first
time I have used wget to grab one specific file, but not knowing
exactly where in the directory tree of the website the file is located.

Randall

Maybe y'all already know this, but one tip is to use the "Network" tabin your browser developer tools and monitor the requests as the pageloads. You should be able to see index.js being loaded by the browser.Then, you can right-click it and "Copy as cURL (POSIX)" (confirmed onFirefox, but I think Chrome has something similar).

A curl command will be copied to your clipboard to download the filewith the headers and user agent the same way your browser did for theoriginal request.


https://everything.curl.dev/usingcurl/copyas

Cheers,
John

Re: [PLUG] playing around with the wget command finally worked

Reply via email to