Yes, you are getting into the world of customizing to purge all the interface controls. Sometimes this is called 'screen-scraping'. There are several web sites that use Flash or Java to prevent this, especially images and graphics that have a copyright.
I have several custom solutions that adjust for individual site formats. The reason for converting the curly quotes is that you were using 'quote' as an item delimiter. Some authors will tweak their code in Microsoft Word or other word processor, copy and paste, therefore mixing in some quote characters that are not char(34) when you are expecting such. Hope this helps Jim Ault Las Vegas ----------- On 11/16/05 2:52 AM, "Thomas Fischer" <[EMAIL PROTECTED]> wrote: > Hi Jim, > > thank you for the hints. > >> Thomas, another technique you might try is locating and using the href = >> ³www.url.com² string. >> This may not be suitable for your purpose, however. >> It also does not explain the debugger anomaly you saw. > Yes, I am still waiting for an explanation of that one. > >> ... >> This assumes there could be various forms of HTML code, where the >> programmer uses returns to make it look good to the eye, >> but the browser simply ignores them. >> Thus the following (3) URL variations are identical to the browser >> ... > > If this is an arbitrary web pages, things can become fairly complicated. I > don't know what the correct rules are for the interpretation of a line break > in the html source. > I know that > <A HREF="www.bigbusiness.com/ > product334.html> > as well as > <A > HREF="www.bigbusiness.com/product334.html> > are interpreted as > <A HREF="www.bigbusiness.com/product334.html> > so a line break may be interpreted as empty or as space. > > In my case I am looking into Google results, which are pretty standardized, > and I don't want _all_ links, but only those to the found pages. And these > tend to be the first word in quotes after "<p class=g>". > > In the general setting one would have to gather examples of the weird things > that may happen, but in any case one would have to get rid of returns and > extract the <BASE HREF="..."> information if present. > >> From my experience > - no need to worry about numtochar(210) and numtochar(211), these are > interpreted as characters, not as quotes > - but there may be links with no quotes at all (will work with Firefox > anyway). > > For bulk processing (e.g. harvesting entire web sites) I would shy away from > regular expressions (unless speed is improved dramatically) and try something > like > > replace numToChar(10) with empty in theSearchResult > replace numToChar(13) with empty in theSearchResult > replace "href =" with return in theSearchResult > replace "href=" with return in theSearchResult > -- (replace is case insensitive by default) > repeat for each line myLine in theSearchResult > put word 1 of myLine & return after foundURLs > end repeat > > And by the way, I think that something along these lines will be a better > solution to my first problem as well, getting rid of any array. > > All the best > Thomas > > _______________________________________________ > use-revolution mailing list > use-revolution@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-revolution _______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution