Well Well Well, Anthra you are a clever person, Are nt you!!!! I nearly fell over when i read your post. Would it help if we used another web site to gather data???? As you stated the tables are not all that well structured. well I will give thisone a go first and if there is anything I can do for you just ask and I will try my best. I really appreciate what you have done. Of course I will try to follow your code to see if any will fall on me....LOL Regards Graham
"Anthra Norell" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > ----- Original Message ----- > From: "Graham Feeley" <[EMAIL PROTECTED]> > Newsgroups: comp.lang.python > To: <python-list@python.org> > Sent: Friday, July 28, 2006 5:11 PM > Subject: Re: Newbie..Needs Help > > >> Thanks Nick for the reply >> Of course my first post was a general posting to see if someone would be >> able to help >> here is the website which holds the data I require >> http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=27/07/2006&meetings=bdgo >> >> The fields required are as follows >> NSW Tab >> # Win Place >> 2 $4.60 $2.40 >> 5 $2.70 >> 1 $1.30 >> Quin $23.00 >> Tri $120.70 >> Field names are >> Date ( not important ) >> Track................= Bendigo >> RaceNo............on web page >> Res1st...............2 >> Res2nd..............5 >> Res3rd..............1 >> Div1..................$4.60 >> DivPlc...............$2.40 >> Div2..................$2.70 >> Div3..................$1.30 >> DivQuin.............$23.00 >> DivTrif...............$120.70 >> As you can see there are a total of 6 meetings involved and I would need >> to >> put in this parameter ( =bdgo) or (=gosf) these are the meeting tracks >> >> Hope this more enlightening >> Regards >> graham >> > > Graham, > > Only a few days ago I gave someone a push who had a very similar problem. > I handed him code ready to run. I am doing it again for > you. > The site you use is much harder to interpret than the other one was > and so I took the opportunity to experimentally stretch > the envelope of a new brain child of mine: a stream editor called SE. It > is new and so I also take the opportunity to demo it. > One correspondent in the previous exchange was Paul McGuire, the > author of 'pyparse'. He made a good case for using 'pyparse' > in situations like yours. Unlike a stream editor, a parser reads structure > in addition to data and can relate the data to its > context. > Anlayzing the tables I noticed that they are poorly structured: The > first column contains both data and ids. Some records are > shorter than others, so column ids have to be guessed and hard coded. > Missing data sometimes is a dash, sometimes nothing. The > inconsistencies seem to be consistent, though, down the eight tables of > the page. So they can be formalized with some confidence > that they are systematic. If Paul could spend some time on this, I'd be > much interested to see how he would handle the relative > disorder. > Another thought: The time one invests in developing a program should > not exceed the time it can save overall (not talking > about recreational programming). Web pages justify an extra measure of > caution, because they may change any time and when they do > they impose an unscheduled priority every time the reader stops working > and requires a revision. > > So, here is your program. I write it so you can copy the whole thing to a > file. Next copy SE from the Cheese Shop. Unzip it and put > both SE.PY and SEL.PY where your Python progams are. Then 'execfile' the > code in an IDLE window, call 'display_horse_race_data > ('Bendigo', '27/07/2006') and see what happens. You'll have to wait ten > seconds or so. > > Regards > > Frederic > > ###################################################################################### > > TRACKS = { 'New Zealand' : '', > 'Bendigo' : 'bdgo', > 'Gosford' : 'gosf', > 'Northam' : 'nthm', > 'Port Augusta': 'pta', > 'Townsville' : 'town', > } > > > # This function does it all once all functions are loaded. If nothing > shows, the > # page has not data. > > def display_horse_race_data (track, date, clip_summary = 100): > > """ > tracks: e.g. 'Bendigo' or 'bdgo' > date: e.g. '27/07/2006' > clip_summary: each table has a long summary header. > the argument says hjow much of it to show. > """ > > if track [0].isupper (): > if TRACKS.has_key (track): > track = TRACKS [track] > else: > print 'No such track %s' % track > return > open () > header, records = get_horse_race_data (track, date) > show_records (header, records, clip_summary) > > > > ###################################################################################### > > > import SE, urllib > > _is_open = 0 > > def open (): > > global _is_open > > if not _is_open: # Skip repeat calls > > global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, > CSV_Maker > > # Making the following Editors is a step-by-step process, adding one > element at a time and > # looking at what it does and what should be done next. > # Get pertinent data segments > header = ' "~(?i)Today\'s Results - .+?<div > style="padding-top:5px;">~==*END*OF*HEADER*" ' > race_summary = ' "~(?i)Race [1-9].*?</font><br>~==" ' > data_segment = ' "~(?i)<table border=0 width=100% cellpadding=0 > cellspacing=0>(.|\n)*?</table>~==*END*OF*SEGMENT*" ' > Data_Filter = SE.SE (' <EAT> ' + header + race_summary + > data_segment) > > # Some data items are empty. Fill them with a dash. > mark_null_data = ' "~(?i)>\s* \s*</td>~=>-" ' > Null_Data_Marker = SE.SE (mark_null_data + ' " = " ') > > # Dump the tags > eat_tags = ' "~<(.|\n)*?>~=" ' > eat_comments = ' "~<!--(.|\n)*?-->~=" ' > Tag_Stripper = SE.SE (eat_tags + eat_comments + ' (13)= ') > > # Visual inspection is easier without all those tabs and empty lines > Space_Deflator = SE.SE ('"~\n[\t ]+~=(10)" "~[\t ]+\n=(10)" | > "~\n+~=(10)"') > > # Translating line breaks to tabs will make a tab-delimited CSV > CSV_Maker = SE.SE ( '(10)=(9)' ) > > _is_open = 1 # Block repeat calls > > > > def close (): > > """Call close () if you want to free up memory""" > > global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, > CSV_Maker > del Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, > CSV_Maker > urllib.urlcleanup () > del urllib > del SE > > > > def get_horse_race_data (track, date): > > """tracks: 'bndg' or (the other one) > date: e.g. '27/07/2006' > The website shows partial data or none at all, probably depending on > race schedules. The relevance of the date in the url is unclear. > """ > > def make_url (track, date): > return > 'http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=%s&meetings=%s' > > % (date, track) > > page = urllib.urlopen (make_url (track, date)) > p = page.read () > page.close () > # When developing the program, don't get the file from the internet on > # each call. Download it and read it from the hard disk. > > raw_data = Data_Filter (p) > raw_data_marked = Null_Data_Marker (raw_data) > raw_data_no_tags = Tag_Stripper (raw_data_marked) > raw_data_compact = Space_Deflator (raw_data_no_tags) > data = CSV_Maker (raw_data_compact) > header, tables = data.split ('*END*OF*HEADER*', 1) > records = tables.split ('*END*OF*SEGMENT*') > return header, records [:-1] > > > > def show_record (record, clip_summary = 100): > > """clip_summary: None will display it all""" > > # The records all have 55 fields. > # These are the relevant indexes: > SUMMARY = 0 > FIRST = 8 > FIRST_NSWTAB_WIN = 9 > FIRST_NSWTAB_PLACE = 10 > FIRST_TABCORP_WIN = 11 > FIRST_TABCORP_PLACE = 12 > FIRST_UNITAB_WIN = 13 > FIRST_UNITAB_PLACE = 14 > SECOND = 15 > SECOND_NSWTAB_PLACE = 17 > SECOND_TABCORP_PLACE = 19 > SECOND_UNITAB_PLACE = 21 > THIRD = 22 > THIRD_NSWTAB_PLACE = 23 > THIRD_TABCORP_PLACE = 24 > THIRD_UNITAB_PLACE = 25 > QUIN_NSWTAB_PLACE = 28 > QUIN_TABCORP_PLACE = 30 > QUIN_UNITAB_PLACE = 32 > EXACTA_NSWTAB_PLACE = 35 > EXACTA_TABCORP_PLACE = 37 > EXACTA_UNITAB_PLACE = 39 > TRI_NSWTAB_PLACE = 41 > TRI_TABCORP_PLACE = 42 > TRI_UNITAB_PLACE = 43 > DDOUBLE_NSWTAB_PLACE = 46 > DDOUBLE_TABCORP_PLACE = 48 > DDOUBLE_UNITAB_PLACE = 50 > SUB_SCR_NSW = 52 > SUB_SCR_TABCORP = 53 > SUB_SCR_UNITAB = 54 > > if clip_summary == None: > print record [SUMMARY] > else: > print record [SUMMARY] [:clip_summary] + '...' > print > > # Your specification: > # Date ( not important ) -> In url and summary of first > record > # Track................= Bendigo -> In url and summary of first > record > # RaceNo............on web page -> In summary (index of record + 1?) > # Res1st...............2 > # Res2nd..............5 > # Res3rd..............1 > # Div1..................$4.60 > # DivPlc...............$2.40 > # Div2..................$2.70 > # Div3..................$1.30 > # DivQuin.............$23.00 > # DivTrif...............$120.70 > > print 'Res1st > %s' % record [FIRST] > print 'Res2nd > %s' % record [SECOND] > print 'Res3rd > %s' % record [THIRD] > print 'Div1 > %s' % record [FIRST_NSWTAB_WIN] > print 'DivPlc > %s' % record [FIRST_NSWTAB_PLACE] > print 'Div2 > %s' % record [SECOND_NSWTAB_PLACE] > print 'Div3 > %s' % record [THIRD_NSWTAB_PLACE] > print 'DivQuin > %s' % record [QUIN_NSWTAB_PLACE] > print 'DivTrif > %s' % record [TRI_NSWTAB_PLACE] > > # Add others as you like from the list of index names above > > > > def show_records (header, records, clip_summary = 100): > > print '\n%s\n' % header > for record in records: > show_record (record.split ('\t'), clip_summary) > print '\n' > > > ########################################################################## > # > # show_records (records, 74) displays: > # > # Today's Results - 27/07/2006 BENDIGO > # > # Race 1 results:Carlsruhe Roadhouse Mdn Plate $11,000 2yo Maiden 1400m > Appr... > # > # Res1st > 2 > # Res2nd > 5 > # Res3rd > 1 > # Div1 > $4.60 > # DivPlc > $2.40 > # Div2 > $2.70 > # Div3 > $1.30 > # DivQuin > $23.00 > # DivTrif > $120.70 > # > # > # Race 2 results:Gerard K. House P/L Mdn Plate $11,000 3yo Maiden 1400m > Appr... > # > # Res1st > 6 > # Res2nd > 7 > # Res3rd > 5 > # Div1 > $3.50 > # DivPlc > $1.60 > # Div2 > $2.60 > # Div3 > $1.40 > # DivQuin > $18.60 > # DivTrif > $75.80 > # > # > # Race 3 results:Richard Cambridge Printers Mdn $11,000 3yo Maiden 1400m > Appr... > # > # Res1st > 11 > # Res2nd > 12 > # Res3rd > 1 > # Div1 ... > # > # ... etc > # > > > -- http://mail.python.org/mailman/listinfo/python-list