Re: Coding a program to do mass downloading help

Joseph Sinclair Wed, 19 May 2010 18:42:56 -0700

WGet works great, with a couple caveats:
1) When using wget, have something in the script check each download for sanity 
(size usually works, and if MP3, check that the first 3 bytes are ASCII for 
"ID3").
Someone I work with did a mass download like you're describing from a huge 
music company (75TB of MP3 files at 256KBit encoding for ~50% of the catalog), 
and found that about 10% failed resulting in an HTML error page in the MP3 
file, instead of the actual content.


2) Authentication can be very tricky, so try it by just reading the page first, 
and see if it works before trying to go crazy with recursive options.

3) when using recursive, SET THE DEPTH LIMIT to something like (1).  Failure to 
do so can result in an error response turning the download process into a 
crawl-the-whole-web process.

AZ RUNE wrote:
> Yes once logged in they are just links on a page.
> 
> Thanks,
> Brian
> 
> On Wed, May 19, 2010 at 2:24 PM, Dan Dubovik <dand...@gmail.com> wrote:
> 
>> wget?
>>
>> If there are simply links on the page to get, you can use the recursive
>> option:
>>
>>        -r
>>        --recursive
>>            Turn on recursive retrieving.
>>
>>
>> If you have a list of the URLs for the files to get:
>>        -i file
>>        --input-file=file
>>            Read URLs from file.  If - is specified as file, URLs are read
>> from the standard input.  (Use ./- to read from a file literally named -.)
>>
>>            If this function is used, no URLs need be present on the command
>> line.  If there are URLs both on the command line and in an input file,
>> those on the command lines will be the first ones to be retrieved.  The file
>> need not be an HTML document (but no harm if it is)---it is enough if the
>> URLs are just listed sequentially.
>>
>>            However, if you specify --force-html, the document will be
>> regarded as html.  In that case you may have problems with relative links,
>> which you can solve either by adding "<base href="url">" to the documents
>> or by specifying --base=url on the command line.
>>
>> On Wed, May 19, 2010 at 1:44 PM, AZ RUNE <arizona.r...@gmail.com> wrote:
>>
>>> I have a friend that does DJ work with a subscription to a closed music
>>> repository.
>>>
>>> In the repository there are 4 categories of music he wants to download
>>> with 4,000+ songs per category
>>>
>>> Is there a program that will do that automated over http if given the url?
>>> Or would it have to be custom built?
>>>
>>> Any ideas?
>>>
>>> --
>>> Brian Fields
>>> arizona.r...@gmail.com
>>>
>>>
>>> ---------------------------------------------------
>>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>>> To subscribe, unsubscribe, or to change your mail settings:
>>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>>
>>
>> ---------------------------------------------------
>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

signature.asc
Description: OpenPGP digital signature

---------------------------------------------------
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: Coding a program to do mass downloading help

Reply via email to