On 11/20/2023 2:33 PM, wes wrote:
I imagine the intention of the robots file (in this case set to disallow
all "automated" requests) is to reduce web crawler traffic.
what's ironic is that the worst offenders already ignore it.
-wes
On Mon, Nov 20, 2023 at 2:32 PM American Citizen <[email protected]>
wrote:
I am making a good faith effort to contact the site administrators. What
is ironic is that anyone can use the save page command in the standard
browser tools and get the file that way without asking at all.
On 11/20/23 13:58, American Citizen wrote:
At the risk of being blocked by the Skalfti website, I found that the
following wget command grabs one and only one file
%wget -r -A 'index.js' -e robots=off -O index.js https://vafri.is/quake/
Notice that I had to give the file a name using the -O option, and it
is stored in the current working directory.
I read that using the option -e robots=off is considered rude.. is
that generally so?
Thanks for bearing with me on this question, as this is the very first
time I have used wget to grab one specific file, but not knowing
exactly where in the directory tree of the website the file is located.
Randall
Maybe y'all already know this, but one tip is to use the "Network" tab
in your browser developer tools and monitor the requests as the page
loads. You should be able to see index.js being loaded by the browser.
Then, you can right-click it and "Copy as cURL (POSIX)" (confirmed on
Firefox, but I think Chrome has something similar).
A curl command will be copied to your clipboard to download the file
with the headers and user agent the same way your browser did for the
original request.
https://everything.curl.dev/usingcurl/copyas
Cheers,
John