I was wondering whether the script referred to here is available anywhere.
Thanks, H --- In [email protected], "tpowers2010" <wing...@...> wrote: > > I'm currently working on a Python 2.5 script to download all the stocks > listed in the Yahoo Industry Browser <http://biz.yahoo.com/p/> by > sector then industry. > > I basically do the same thing that is done by the Excel workbook found > at http://icc-az.com/amibroker_files%5CStocks_XLS.zip > <http://icc-az.com/amibroker_files%5CStocks_XLS.zip> . However, that > page says "Since this using plain VBA for all extraction, it is very > slow. Expect 12 hours to do an extract...". > > For comparison, my Python script currently takes about 8 minutes or so. > The main reason is that I can get ticker, company name, sector, and > industry without having to download the individual company profile > pages. And, unlike the Excel solution which downloads entire webpages > (including images), I only have to grab the basic html page. > > Using the Python 3rd party BeautifulSoup module > <http://www.crummy.com/software/BeautifulSoup/> , it turns out it's > pretty easy to extract the required information from the raw html > (rather than making Excel convert webpages to spreadsheets). > > Finally, to get the exchange information, instead of having to read each > company's profile page I use the > http://finance.yahoo.com/d/quotes.csv?s=TICKERS&f=x > <http://finance.yahoo.com/d/quotes.csv?s=TICKERS&f=x> URL with TICKERS > replaced with a + separated list of ticker symbols to get the exchanges > for 200 companies at once. > > A caveat is that it turns out that getting info from the Industry > Browser pages alone surprisingly yields ticker symbols that are already > incorrect! (This seems to happen for any stock whose exchange is listed > as "n/a". My impression is that the newer Yahoo > <http://biz.yahoo.com/ic/ind_index.html> Industry Center > <http://biz.yahoo.com/ic/ind_index.html> page is more accurate but > slightly harder to parse. > > Therefore to be absolutely sure that the tickers are valid, you end up > having to make sure you can download each companies profile or quotes > page. The only time I've tried doing that took about 3 hours. As a side > benefit of this process you can scape additional information on each > company (such as number of employees). Only about 10 or so of the 7500+ > symbols were listed incorrectly on the main Industry Browser pages (all > of them being OTC BB traded stocks). > > I'm thinking about using multiple threads to download say 10 pages at > once to speed up this last process. Unfortunately, I didn't design the > original code to be thread-safe so this will take some work. > > Once I have the basic stock information I spit out a .csv list (readable > by Excel), broker.sectors, and broker.industries files. I also use a > separate small Python script to initialize a new AmiBroker database. You > have to manually update the Markets since there is apparently no way to > do this from COM (but there are only 8 of them). > > One thing I noticed is that the brokers.industries file used to > initialize new databases seems to have an undocumented limit of about 38 > or 39 characters for Industry Name? The "Textile - Apparel Footwear & > Accessories" industry gets truncated and a bogus industry gets added > unless I first limit the industry name length. > > Also, Industries don't appear to be sorted correctly under their Sectors > (I saw another post here that mentions the same thing). > > Anyway, this is all somewhat of a work in progress. It also is a > command-line only script. There is no GUI associated with it. You'll > have to be comfortable with installing ActiveState's free python 2.5 for > Windows distribution, installing the BeautifulSoup, and mechanize > modules, and running scripts from a Command Prompt. >
