No. While I tried his stuff I noticed that the US Stocks database seemed to be out of date. Since I knew this kind of thing is "simple" using Python I decided to write my own downloader from scratch.
Of course the devil is in the details and I still have a few issues to sort out. --- In [email protected], "areehoi" <aree...@...> wrote: > > This is great news. Have you been in contact with Jim Swindle. I know he > has been looking for someone to take over the updating of the US-Stocks > database. He has provided a great service so hopefully you can do the same as > you've solved the main ingredient. Let us hear from you on progress on the > project. Thanks for your interest and help. > > Dick H > > --- In [email protected], "tpowers2010" <wingusr@> wrote: > > > > I'm currently working on a Python 2.5 script to download all the stocks > > listed in the Yahoo Industry Browser <http://biz.yahoo.com/p/> by > > sector then industry. > > > > I basically do the same thing that is done by the Excel workbook found > > at http://icc-az.com/amibroker_files%5CStocks_XLS.zip > > <http://icc-az.com/amibroker_files%5CStocks_XLS.zip> . However, that > > page says "Since this using plain VBA for all extraction, it is very > > slow. Expect 12 hours to do an extract...". > > > > For comparison, my Python script currently takes about 8 minutes or so. > > The main reason is that I can get ticker, company name, sector, and > > industry without having to download the individual company profile > > pages. And, unlike the Excel solution which downloads entire webpages > > (including images), I only have to grab the basic html page. > > > > Using the Python 3rd party BeautifulSoup module > > <http://www.crummy.com/software/BeautifulSoup/> , it turns out it's > > pretty easy to extract the required information from the raw html > > (rather than making Excel convert webpages to spreadsheets). > > > > Finally, to get the exchange information, instead of having to read each > > company's profile page I use the > > http://finance.yahoo.com/d/quotes.csv?s=TICKERS&f=x > > <http://finance.yahoo.com/d/quotes.csv?s=TICKERS&f=x> URL with TICKERS > > replaced with a + separated list of ticker symbols to get the exchanges > > for 200 companies at once. > > > > A caveat is that it turns out that getting info from the Industry > > Browser pages alone surprisingly yields ticker symbols that are already > > incorrect! (This seems to happen for any stock whose exchange is listed > > as "n/a". My impression is that the newer Yahoo > > <http://biz.yahoo.com/ic/ind_index.html> Industry Center > > <http://biz.yahoo.com/ic/ind_index.html> page is more accurate but > > slightly harder to parse. > > > > Therefore to be absolutely sure that the tickers are valid, you end up > > having to make sure you can download each companies profile or quotes > > page. The only time I've tried doing that took about 3 hours. As a side > > benefit of this process you can scape additional information on each > > company (such as number of employees). Only about 10 or so of the 7500+ > > symbols were listed incorrectly on the main Industry Browser pages (all > > of them being OTC BB traded stocks). > > > > I'm thinking about using multiple threads to download say 10 pages at > > once to speed up this last process. Unfortunately, I didn't design the > > original code to be thread-safe so this will take some work. > > > > Once I have the basic stock information I spit out a .csv list (readable > > by Excel), broker.sectors, and broker.industries files. I also use a > > separate small Python script to initialize a new AmiBroker database. You > > have to manually update the Markets since there is apparently no way to > > do this from COM (but there are only 8 of them). > > > > One thing I noticed is that the brokers.industries file used to > > initialize new databases seems to have an undocumented limit of about 38 > > or 39 characters for Industry Name? The "Textile - Apparel Footwear & > > Accessories" industry gets truncated and a bogus industry gets added > > unless I first limit the industry name length. > > > > Also, Industries don't appear to be sorted correctly under their Sectors > > (I saw another post here that mentions the same thing). > > > > Anyway, this is all somewhat of a work in progress. It also is a > > command-line only script. There is no GUI associated with it. You'll > > have to be comfortable with installing ActiveState's free python 2.5 for > > Windows distribution, installing the BeautifulSoup, and mechanize > > modules, and running scripts from a Command Prompt. > > >
