[wtr-general] Re: Pulling hair out on screen scraping

Charley Baker Sat, 03 Jan 2009 12:41:41 -0800

It can be a bit overwhelming to learn Ruby and various libraries at the same
time. I'd recommend taking a look at the Pickaxe book:
http://whytheluckystiff.net/ruby/pickaxe/   just to get some general
familiarity. There are other Ruby tutorials online as well as some good
books - The Ruby Way, Everyday Scripting, OReilly's Ruby book.
succ! as you mention below is a Ruby core method. Gotapi also has a good
searchable reference to Ruby standard api. http://www.gotapi.com/html  click
on the Ruby Standard Packages. The pickaxe book from the link above also has
an index of the core api, many with examples.
Here's a link to the Watir rdocs in case you might find that useful.
http://wtr.rubyforge.org/rdoc/ and a link to supported elements(though
openqa is down right now):
http://wiki.openqa.org/display/WTR/Methods+supported+by+Element


Strange that the hpricot site is down now as well.

Another useful way to learn how to use libraries in Ruby is by taking a look
at their unit tests. Watir has a large number of unit tests, hpricot has
some too. They're located under your ruby install directory in gems.

Ruby comes with a few documentation systems: ri and rdoc. For the gems you
have installed locally you can see all of the rdocs by going to the command
line, type:
gem server
Then browse to http://localhost:8808
ri can also be used from the command line:
ri String::succ!

Additional responses inline:



On Sat, Jan 3, 2009 at 10:31 AM, Bissquitt <bissqu...@gmail.com> wrote:

>
> Regarding documentation, I read the Tutorial all the way through but
> it only hit on a few specific examples leaving out other commands all
> together. I've visited MANY ruby and watir sites and never once saw
> the .span command (does it just search for <span> tags? guess ill
> google it after this post) I never even found a site listing all the
> watir commands ( http://us.php.net/manual/en/function.abs.php ) as an
> example. In addition there are SO MANY tutorials and such online that
> are all very poorly done it makes finding a good one via google a
> needle in a haystack scenario. ie (oh great, you showed me that
> specific command, but showed me nothing about how that command works
> so unless I want to use it exactly the way you used it, its useless).
> My example here is the "ruby on windows" site. If I google for
> anything regarding ruby and excel I either get that site, or another
> site that just provides me a link to that site and am forced to make
> due with that site in order to teach myself how to interact with
> excel. The site itself lists a BUNCH of examples but leaves it up to
> you to try and pick apart the syntax to understand what it is doing.
> For example:
>
> line = '1'
> while worksheet.Range("a#{line}")['Value']
>   line.succ!
> end
> #line now holds row number of first empty row
>
> What on earth does .succ! do? It never tells me. The site, and most
> that ive seen, are written not to target new people and tutor them but
> to target advanced users with a more "so heres a cool way to approach
> the problem" approach. A simple "ok, here is the the excel class, here
> are the comands in it and what they do, here is a syntax example"
> would be far more helpful as it doesn't leave anything out. I'm still
> not sure if its possible to return what row the active cell is on.
>

Excel is a strange one. :) Agreed that most sites assume a basic familiarity
with Ruby, and with the links above you should be able to get into it fairly
quickly. Accessing Excel is done through it's COM interface, so one of the
best sources of documentation is actually the Excel VBA Microsoft help file.
There's a link to the standalone version of it somewhere on the internets if
you don't have it installed. There are some excel libraries on our wiki as
well as a project on Rubyforge called Rasta which use Excel. You can browse
through the source code for those.


>
> ...Which is when I decided to ask actual people and ended up here.
> (thanks again btw)
>
>
> ...After that long winded response, I was trying to using Watir to
> scrape the page because I was having issues with the the javascript
> not being executed before the scrape (when i did it in php) and
> figured that a driven web brower would be sure to get it...hence
> watir.


Yep, makes sense. Watir is great at testing heavy js sites, ajaxy stuff and
the generated DOM instead of the page source.


>
>
> The reason my example was not using watir is because I was unable to
> find any documentation on how to do what I needed. I saw the
> browser.links and browser.table but those were the only 2 I found,
> there was no, "here is a list of the commands" as I mentioned above.
> Consiquently I found even less on hpricot since all I get is a 404 on
> its main site, and every other site links to it so wether or not it
> was documented is irrelevent, all I have to work with is trying to
> piece together other peoples code and work with it.
>
> I don't quite follow your first example since I am barely familiar
> with ruby syntax (though it appears to be similar to java) what is the
> |s| ?



 browser.spans.each {|s| puts s.text}
Ruby is different than Java, C++ and some other languages in this respect,
Ruby uses internal iterators instead of the external iterators used in Java.
So in this example, spans is a collection of spans on the page, similar
collections exist for other html elements - divs, links, lis, etc. The
collection is enumerable with each. each takes a block (surrounded by curly
braces or do...end), iterates through each item and passes it to |s| in this
case. So for each span it sends it in, assigns it to the local variable s,
and then you can do what you want with each one.



>
> Your second example seems to be much closer to what I need since there
> are MANY spans on the page but only a handfull matching the regexp
> pattern I gave above.
>
> Would you be able to break down the second example for me?
>
> var = browser.span(:id, /ctl/).text
>
> I know:
> var is the variable being stored into
> browser is the watir browser object being driven
> I'm guessing span just looks for span tags?
> I'm also guessing that (:id, /ctl/) looks for any span tag with an id
> matching /ctl/ ? (this is where im not following you as much)
> what does the : in your example do? what exactly is the second
> argument doing, what are the slashes?
> and what does the .text at the end do?


Sure, the basic syntax for elements in Watir is browser.html_element(:how,
what)
In this case that breaks down to
- call the span method on browser
- how - we want to find an element by it's id attribute so we use a symbol
denoted by : to specify id - :id   symbols are essentially lightweight
Strings, think of it as a pointer to a string rather than creating a new
string in some memory space, :symbol points to that one string during the
entire program execution (if that doesn't make sense then accept it on faith
for now :) )
- what - since we're looking for an id, it's generally a string or a regex.
'foo' would be a string, therefore finding a span that has an id of 'foo',
the slashes in this case create a regex, so find a span with ctl in it's id
(the first one matching will be returned). // in Ruby creates a regular
expression object. You can see this by typing //.class in irb.



>
> Sorry for being rather dense but I have barely delt with web
> programming before. I've spent my life doing C++, Java, and BASIC so
> I'm pretty much trying to stumble into a final product as gracefully
> as I can.
>
> Michael
>
>
No worries. I came to Ruby from C/C++, Java, perl, php, etc. The basics
should be easy to learn, the power of some of its features will take some
time to sink in. Hopefully some of this helps.




>
>
> On Jan 3, 12:37 pm, "Charley Baker" <charley.ba...@gmail.com> wrote:
> > Hi there,
> >   I'm not sure what you mean by Ruby and Watir being poorly documented.
> For
> > Ruby, the first edition of the Pickaxe book which is comprehensive is
> free
> > and available online. There are dozens of other tutorials, sites and
> blogs
> > about Ruby. Watir also has a lot of examples, a tutorial(
> http://wiki.openqa.org/display/WTR/Tutorial) and other information on the
> > wiki, if there's something you feel is missing, don't hesitate to suggest
> it
> > or add it yourself.
> >
> >   Oddly, your example doesn't use Watir at all. If you wanted to use
> Watir
> > to do the same thing here are some possibilities:
> >
> > browser.spans.each {|s| puts s.text}   #do something else with the span
> in
> > the block if you want - e.g. assign some variables, etc
> > var = browser.span(:id, /ctl/).text       #find the span by a regex and
> > assign it to a variable
> >
> > An interesting example using hpricot and regexs to find book information
> -
> > ISBN, price, etc.
> >
> > Scrubyt is another library for screen scraping which internally uses
> either
> > Firewatir or Mechanize, here's a link to some examples:
> http://wiki.scrubyt.org/index.php?title=Tutorials
> >
> > HTH,
> >
> > Charley Baker
> > blog:http://charleybakersblog.blogspot.com/
> > Project Manager, Watir,http://wtr.rubyforge.org
> > QA Architect, Gap Inc Direct
> >
> >
> >
> > On Sat, Jan 3, 2009 at 7:12 AM, Bissquitt <bissqu...@gmail.com> wrote:
> >
> > > forgot to include the code I have thus far. (currently not working do
> > > to the Hpricot portion)
> >
> > > excel = WIN32OLE.new("excel.application")
> > > excel.visible = true
> > > workbook = excel.workbooks.open('E:\books\spring 09 classes.xls')
> > > worksheet=workbook.worksheets(1)
> >
> > > contLoop = true
> > > row = 1
> >
> > > while contLoop do colVal = worksheet.Cells(row, 'a').Value
> > >      if (colVal) then
> > >          doc = Hpricot(open("http://bookstore.umbc.edu/
> > > SelectCourses.aspx?src=2&type=2&stoid=9&trm=Spring%2009&cid=<
> http://bookstore.umbc.edu/SelectCourses.aspx?src=2&type=2&stoid=9&trm...>
> > > (colVal)"))
> > >          a = doc.search("sp...@id='rptCourses_ctl00_rptItems_ctl\d
> > > \d_lblItemTxtTitle']").inner_text
> > >          worksheet.Cells(row, 'f').value = a
> >
> > >      else
> > >          contLoop = false
> > >      end
> >
> > >      row +=  1
> > >      sleep 1
> > > end
> >
> > > On Jan 3, 8:32 am, Bissquitt <bissqu...@gmail.com> wrote:
> > > > Granted I am new to Watir and ruby in general but I do have a
> > > > background of programming. My brief experience has been that watir
> and
> > > > ruby are awesome but VERY poorly documented, which is odd concidering
> > > > the massive amount of web pages dedicated to it.
> >
> > > > anyway, here is the issue I am having.
> >
> > > > I am trying to screen scrape book information from a college
> > > > bookstores website. My first attempt was php (and I had a full script
> > > > done for it) then realized that the site uses javascript to get info
> > > > from their database and all I was scraping was the static HTML and
> > > > missed the generated stuff I need.
> >
> > > > The script in theory:
> > > > opens an excel document,
> > > > looks at (A1) and goes to "www.website.com/(A1)" where (A1) is a
> > > > course number,
> > > > stores Title, ISBN and other info into B1, C1, D1 etc (I also have to
> > > > take into account more than 1 book per class) though once I get the
> > > > first I should be able to do this.
> > > > goes to (A2) and repeats.
> >
> > > > From what I have seen there are 2 ways to do this each with its own
> > > > problem.
> >
> > > > 1) use hpricot or some other parser to find the proper tag. This has
> 2
> > > > issues.
> >
> > > > <span
> > > > id="rptCourses_ctl00_rptItems_ctl00_lblItemTxtISBN">9780324574289</
> > > > span>
> > > > The second ctl00 itterates to ctl01 for the second book (I am hoping
> I
> > > > can just use regexp in line)
> >
> > > > The second issue is that I have not been able to figure out how to
> > > > pick out a span tag. There are all sorts of commands for finding
> links
> > > > and tables and such but I cant figure out how to pick out that
> > > > particular tag (specificly with hpricot)
> >
> > > > 2) Load the entire page into a variable, strip out all new lines and
> > > > tabs, scan entire page for specific regexp
> > > > <span id="rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle"
> > > > style="font-weight:bold;">[^<]+<\/span>
> > > > I know this works, I used rubulator to test it. It returns all titles
> > > > of books on the page, I do forsee an issue of which title belongs to
> > > > which other info if I do it that way though.
> >
> > > > If an exact example is required I can give out all the info you
> > > > require though I figured it would be more clutter than helpful. An
> > > > actual syntax example would be most helpful rather than just refering
> > > > me to a class definition though I will take whatever is offered.
> >
> > > > Many thanks,
> > > > Michael- Hide quoted text -
> >
> > - Show quoted text -
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Watir General" group.
To post to this group, send email to watir-general@googlegroups.com
Before posting, please read the following guidelines: 
http://wiki.openqa.org/display/WTR/Support
To unsubscribe from this group, send email to 
watir-general-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/watir-general
-~----------~----~----~----~------~----~------~--~---

[wtr-general] Re: Pulling hair out on screen scraping

Reply via email to