Re: [PLUG] Using wget to download all files from a web site (2)

2023-11-17 Thread Keith Lofstrom
On Fri, Nov 17, 2023 at 12:43:29PM -0800, Keith Lofstrom wrote:
...
> I "wget-ed" a website, and was soon contacted by a
> panicked/angry sysadmin watching their website brought
> to a crawl because their 5 mbps upload bandwidth was
> clobbered for hours by my scrape of their site.  My bad.

When you connect through  the internet, packets flow both
ways - ACK packets tell the sending process which packets
arrived and do not need to be re-sent.

If the data packets you request travel down the same
asymmetric, bandwidth-limited channel as the web-surfing
and email ACK packets of the employees at the Portland
EPA office, they can't do their web-work, and they will
designate your office network connection a "toxic internet
packet super-fund site".  :-)

Just kidding.  I hope.

This is something we should all be aware of when we access
the internet.  Every process and system has constraints and
limits.  Neighborly net users should not heedlessly push 
too hard on those limits, because others will be impacted.

-

That said, in this PARTICULAR case,

https://www.publicdata.com/ 

... looks like a private company DESIGNED to provide bulk
data like you are downloading, so I am probably wrong IN
THIS PARTICULAR CASE.  You are probably NOT stepping on
any toes here.  However, you might learn something
helpful from the publicdata FAQ:

https://login.publicdata.com/faq.html

-

With all the high bandwidth bots roaming the web and
guzzling data at considerable expense to all of us, the
publicdata company may have processes that limit data
rates and thwart bots, so they don't need to purchase
as much bulk bandwidth from THEIR network providers. 

If wget pushes on publicdata.com limits in a bot-like
manner, publicdata server software may treat you like
a bot, and behave in frustrating (and unexplained) ways.
If they frustrate a bot, they need not say they are sorry.

There may be ways to rate-limit your bulk data request,
so it doesn't trigger their rate-limits, and looks more
like an obsessed human user.  I hypothesize; there are
web provider process management experts reading this,
who know how incoming 15 GB requests are handled,
throttled, or thriftily ignored.  Please educate us!

Keith L.

(who remembers 300 baud modems, and long distance
toll rates)

-- 
Keith Lofstrom  kei...@keithl.com


Re: [PLUG] Using wget to download all files from a web site (2)

2023-11-18 Thread Ted Mittelstaedt



-Original Message-
From: PLUG  On Behalf Of Keith Lofstrom
Sent: Friday, November 17, 2023 7:20 PM


>There may be ways to rate-limit your bulk data request, so it doesn't
trigger their rate-limits, and looks more like an obsessed human user.  >I
hypothesize; there are web provider process management experts reading this,
who know how incoming 15 GB requests are handled, >throttled, or thriftily
ignored.  Please educate us!

Yes.  You simply do your data-slurping late at night.   Top bandwidth usage
on the Internet is between 10am-4pm in any given time zone, so if you are
hitting sites in the USA then starting at 6am PST it gets busy (east coast
is why) and by noon the North American network is extremely busy.

If you are a doofus who decides to do your slurping at lunchtime then maybe
you can understand when the various ISPs take a dim view of your idiocy and
shut you down.

But after 11PM it's quiet.   From 3AM PST to 6AM PST you have a window where
activity is very low and you will only be competing with the likes of The
Internet Archive and it's Wayback machine, and the various corporate
"buffered cloud backup" schemes.

Of course, if you are hitting foreign sites then their networks will not
appreciate you but since their networks aren't owned by your ISP your ISP
doesn't give a tinker's damn.

Ted