[wtr-general] Re: Pulling hair out on screen scraping

Charley Baker Sat, 03 Jan 2009 09:39:50 -0800

Hi there,
  I'm not sure what you mean by Ruby and Watir being poorly documented. For
Ruby, the first edition of the Pickaxe book which is comprehensive is free
and available online. There are dozens of other tutorials, sites and blogs
about Ruby. Watir also has a lot of examples, a tutorial(
http://wiki.openqa.org/display/WTR/Tutorial) and other information on the
wiki, if there's something you feel is missing, don't hesitate to suggest it
or add it yourself.


  Oddly, your example doesn't use Watir at all. If you wanted to use Watir
to do the same thing here are some possibilities:

browser.spans.each {|s| puts s.text}   #do something else with the span in
the block if you want - e.g. assign some variables, etc
var = browser.span(:id, /ctl/).text       #find the span by a regex and
assign it to a variable

An interesting example using hpricot and regexs to find book information -
ISBN, price, etc.

Scrubyt is another library for screen scraping which internally uses either
Firewatir or Mechanize, here's a link to some examples:
http://wiki.scrubyt.org/index.php?title=Tutorials

HTH,


Charley Baker
blog: http://charleybakersblog.blogspot.com/
Project Manager, Watir, http://wtr.rubyforge.org
QA Architect, Gap Inc Direct


On Sat, Jan 3, 2009 at 7:12 AM, Bissquitt <bissqu...@gmail.com> wrote:

>
> forgot to include the code I have thus far. (currently not working do
> to the Hpricot portion)
>
> excel = WIN32OLE.new("excel.application")
> excel.visible = true
> workbook = excel.workbooks.open('E:\books\spring 09 classes.xls')
> worksheet=workbook.worksheets(1)
>
>
> contLoop = true
> row = 1
>
>
> while contLoop do colVal = worksheet.Cells(row, 'a').Value
>      if (colVal) then
>          doc = Hpricot(open("http://bookstore.umbc.edu/
> SelectCourses.aspx?src=2&type=2&stoid=9&trm=Spring%2009&cid=<http://bookstore.umbc.edu/SelectCourses.aspx?src=2&type=2&stoid=9&trm=Spring%2009&cid=>
> (colVal)"))
>          a = doc.search("sp...@id='rptCourses_ctl00_rptItems_ctl\d
> \d_lblItemTxtTitle']").inner_text
>          worksheet.Cells(row, 'f').value = a
>
>
>      else
>          contLoop = false
>      end
>
>      row +=  1
>      sleep 1
> end
>
>
> On Jan 3, 8:32 am, Bissquitt <bissqu...@gmail.com> wrote:
> > Granted I am new to Watir and ruby in general but I do have a
> > background of programming. My brief experience has been that watir and
> > ruby are awesome but VERY poorly documented, which is odd concidering
> > the massive amount of web pages dedicated to it.
> >
> > anyway, here is the issue I am having.
> >
> > I am trying to screen scrape book information from a college
> > bookstores website. My first attempt was php (and I had a full script
> > done for it) then realized that the site uses javascript to get info
> > from their database and all I was scraping was the static HTML and
> > missed the generated stuff I need.
> >
> > The script in theory:
> > opens an excel document,
> > looks at (A1) and goes to "www.website.com/(A1)" where (A1) is a
> > course number,
> > stores Title, ISBN and other info into B1, C1, D1 etc (I also have to
> > take into account more than 1 book per class) though once I get the
> > first I should be able to do this.
> > goes to (A2) and repeats.
> >
> > From what I have seen there are 2 ways to do this each with its own
> > problem.
> >
> > 1) use hpricot or some other parser to find the proper tag. This has 2
> > issues.
> >
> > <span
> > id="rptCourses_ctl00_rptItems_ctl00_lblItemTxtISBN">9780324574289</
> > span>
> > The second ctl00 itterates to ctl01 for the second book (I am hoping I
> > can just use regexp in line)
> >
> > The second issue is that I have not been able to figure out how to
> > pick out a span tag. There are all sorts of commands for finding links
> > and tables and such but I cant figure out how to pick out that
> > particular tag (specificly with hpricot)
> >
> > 2) Load the entire page into a variable, strip out all new lines and
> > tabs, scan entire page for specific regexp
> > <span id="rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle"
> > style="font-weight:bold;">[^<]+<\/span>
> > I know this works, I used rubulator to test it. It returns all titles
> > of books on the page, I do forsee an issue of which title belongs to
> > which other info if I do it that way though.
> >
> > If an exact example is required I can give out all the info you
> > require though I figured it would be more clutter than helpful. An
> > actual syntax example would be most helpful rather than just refering
> > me to a class definition though I will take whatever is offered.
> >
> > Many thanks,
> > Michael
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Watir General" group.
To post to this group, send email to watir-general@googlegroups.com
Before posting, please read the following guidelines: 
http://wiki.openqa.org/display/WTR/Support
To unsubscribe from this group, send email to 
watir-general-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/watir-general
-~----------~----~----~----~------~----~------~--~---

[wtr-general] Re: Pulling hair out on screen scraping

Reply via email to