On 11/20/2023 2:33 PM, wes wrote:
I imagine the intention of the robots file (in this case set to disallow
all "automated" requests) is to reduce web crawler traffic.

what's ironic is that the worst offenders already ignore it.

-wes

On Mon, Nov 20, 2023 at 2:32 PM American Citizen <[email protected]>
wrote:

I am making a good faith effort to contact the site administrators. What
is ironic is that anyone can use the save page command in the standard
browser tools and get the file that way without asking at all.

On 11/20/23 13:58, American Citizen wrote:
At the risk of being blocked by the Skalfti website, I found that the
following wget command grabs one and only one file

%wget -r -A 'index.js' -e robots=off -O index.js https://vafri.is/quake/

Notice that I had to give the file a name using the -O option, and it
is stored in the current working directory.

I read that using the option -e robots=off is considered rude.. is
that generally so?

Thanks for bearing with me on this question, as this is the very first
time I have used wget to grab one specific file, but not knowing
exactly where in the directory tree of the website the file is located.

Randall




Maybe y'all already know this, but one tip is to use the "Network" tab in your browser developer tools and monitor the requests as the page loads. You should be able to see index.js being loaded by the browser. Then, you can right-click it and "Copy as cURL (POSIX)" (confirmed on Firefox, but I think Chrome has something similar).

A curl command will be copied to your clipboard to download the file with the headers and user agent the same way your browser did for the original request.

https://everything.curl.dev/usingcurl/copyas

Cheers,
John

Reply via email to