I need to extract the indicted (bold & underlined) numbers from lines coming 
off web pages.

Of course I don't know ahead of time the location or length of the number.  
What I do know
is the tag "Friends", and "Reviews", etc. In fact, it would be good to end up 
with

Value   Variable
108       Friends
151       Reviews
    5       Review Updates
  NA      First                 <-- assuming here that "First" did not show up 
on an line
etc.

Of particular trouble is line [7] which requires extracting 3 numbers 2022 
(Useful), 1591 (Funny) and 1756 (Cool).
============== Extraction problem lines ===========

[1] "\t\t\t<li id=\"friendCount\"><a 
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108 
Friends</a></li>"                       

 [2] "\t\t\t<li id=\"reviewCount\"><a 
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151 
Reviews</a></li>"                  

 [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>"                    
                                                                

 [4] "\t\t\t\t<li id=\"ftrCount\"><a 
href=\"/user_details_reviews_self?review_filter=first&amp;userid=--T8djg0nrb_yMMMA3Y0jQ\">1
 First</a></li>"

 [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"                                  
                                                                

 [6] "\t\t\t\t<li id=\"localPhotoCount\"><a 
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local 
Photos</a></li>" 

 [7] <p id="review_votes" class="smaller"><img 
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif";
 alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to