On 01/11/2014 01:09, Rob Dixon wrote:
I would use the BBC server to do the search for me, after which there is
little work to be done. For instance, if I look for all Book at Bedtime
episodes with this URL


http://www.bbc.co.uk/radio/programmes/a-z/by/book%20at%20bedtime/player

then I am taken a page with a link to the series at

     http://www.bbc.co.uk/programmes/b006qtlx/episodes/player?page=1

through to `page=6`. That amounts to 52 programmes which, even on my
meagre 13 megabit connection that takes less than ten seconds, and the
results could be cached for practically instantaneous response for a
similar request in the future. There is also the possibility of writing
a batch solution that makes a query only every minute or so and could be
run continuously or overnight.

That's a neat idea! (I'd also been concerned with trying to recreate the RSS feeds for programme categories, so I'd focused on pulling everything.)

The search isn't perfect (e.g. try searching for "BBC News"), but you could use that to refine the results to reduce the amount of scraping you need to do, then do better matching against title or synopsis in get_iplayer.

I'm more than happy to write a proof of concept if you're interested. I
have it half-written already just to get that timing information.

The one thing that bothers me is the terms and conditions of the web
site. I scanned through them quickly and couldn't find anything about
robotic access, but it would be a first if there isn't anything there.
If it's just a matter of obeying the /robots.txt then I'm more than
happy to go ahead.


At a glance, robots.txt doesn't seem to disallow accessing the sections needed. In the terms of use, there is this though:

"(d) You agree to use BBC Online Services and access, download, view and/or listen to BBC Content as supplied to you by the BBC and you may not, and you may not assist anyone to, or attempt to, reverse engineer, decompile, disassemble, adapt, modify, copy, reproduce, lend, hire, rent, perform, sub-license, make available to the public, create derivative works from, broadcast, distribute, commercially exploit, transmit or otherwise use in any way BBC Online Services and/or BBC Content in whole or in part except to the extent permitted in these Terms of Use, any relevant Additional Terms and at law."

If I'm downloading pages automatically and automatically reading certain sections of the HTML, is that viewing it as supplied to me by the BBC?

_______________________________________________
get_iplayer mailing list
get_iplayer@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/get_iplayer

Reply via email to