Re: Speeding up get URL

2008-08-05 Thread Dave Cragg


On 4 Aug 2008, at 13:00, Alex Tweedly wrote:


You should be able to achieve that using 'load URL' - set off a  
number of 'load's going and then by checking the URLstatus you can  
process them as they have finished arriving to your machine; and as  
the number of outstanding requested URLs decreases, set off the next  
batch of 'load's.


In this case, Alex, I don't think there will be a noticeable  
difference. Although "load" can make simultaneous requests, requests  
to the same domain are queued, and are sent in turn after the previous  
request returns. As "get" will also re-use open connections, I think  
the result will be pretty much the same.


I like the idea of running a script on the server. (I'm behind a  
fairly slow DSL connection.)


Cheers
Dave


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-05 Thread Alex Tweedly

Sarah Reichelt wrote:

On Mon, Aug 4, 2008 at 12:35 AM, Shari <[EMAIL PROTECTED]> wrote:
  

Goal:  Get a long list of website URLS, parse a bunch of data from each
page, if successful delete the URL from the list, if not put the URL on a
different list.  I've got it working but it's slow.  It takes about an hour
per 10,000 urls.  I sell tshirts.  Am using this to create informational
files for myself which will be frequently updated.  I'll probably be running
this a couple times a month and expect my product line to just keep on
growing.  I'm currently at about 40,000 products but look forward to the day
of hundreds of thousands :-)  So speed is my need... (Yes, if you're
interested my store is in the signature, opened it last December :-)

How do I speed this up?



Shari, I think the delay will be due to the connection to the server,
not your script, so there may not be a lot you can do about it.

I did have one idea: can you try getting more than one URL at the same
time? If you build a list of the URLs to check, then have a script
that grabs the first one on the list, and sends a non-blocking request
to that site, with a message to call when the data has all arrived.
While waiting, start loading the next site and so on. Bookmark
checking software seems to work like this.

  
You should be able to achieve that using 'load URL' - set off a number 
of 'load's going and then by checking the URLstatus you can process them 
as they have finished arriving to your machine; and as the number of 
outstanding requested URLs decreases, set off the next batch of 'load's.


But the likelihood is that this would only make a small difference - the 
majority of the time is probably due to either the server response times 
and/or the delay in simply downloading all those bytes to your machine.  
Out of interest I'd be inclined to count the number of bytes transferred 
per URL and see if that is a significant percentage of your connection 
capacity.


Are you running these from a machine behind a (relatively) slow Internet 
connection, such as a DSL or cable modem ?
If so, you might get a big improvement by converting the script into a 
CGI script, and running it on your own web-hosting server; that would 
give you an effective bandwidth based on the ISP, rather than on a slow 
DSL-like connection. (I have a vaguely similar script I run from my site 
that is approx 1000x times faster than running it from home on a 8Mbs 
DSL - the lower latency helps as much as the increased bandwidth). But 
beware - if there are any issues with looking like a DoS attack, or 
sending too many requests per second, this might be much more likely to 
trigger them; you may also run into issues with usage of CPU and/or 
bandwidth on your hosting-ISP.

Would opening a socket and reading from the socket be any faster? I
don't imagine that it would be, but it might be worth checking.

The other option is just to adjust things so it is not intrusive e.g.
have it download the sites overnight and save them all for processing
when you are ready, or have a background app that does the downloading
slowly (so it doesn't overload your system).
  
On that same idea, but taking it further (maybe too far) - how 
absolutely up-to-date does the info need to be when you run the script ?
Could you process a few thousand URLs per night, caching either the URLs 
as files locally, or caching the extracted data from them. Then when you 
want to run your script, you use all the cached data - so some of it is 
right up to date, while other parts may be up to a few days old.  You 
may also know, or be able to find out, which of the URLs tend to change 
frequently, and therefore bias the background processing accordingly.




And, finally, a couple of trivial issues .



# toDoList needs to have the successful URLs deleted, and failed URLs 
moved to a different list

# that's why p down to 1, for the delete
# URLS are standard http://www.somewhere.com/somePage

  repeat with p = the number of lines of toDoList down to 1
  put url (line p of toDoList) into tUrl
  # don't want to use *it* because there's a long script that follows
  # *it* is too easily changed, though I've heard *it* is faster 
than *put*

  # do the stuff
  if doTheStuffWorked then
 delete line p of toDoList
  else put p & return after failedList
  updateProgressBar # another slowdown but necessary, gives a 
count of how many left to do
   end repeat 
I don't fully understand this (??). What you describe is doing BOTH 
delete the successful ones, and ALSO save the failed ones - so at the 
end, toDoList should finish up the same as failedList. But what your 
pseudo-code actually does is save the indexes of the failed URLs - which 
become invalid once you delete lower numbered lines; I think you 
intended to do

else put (line p of toDoList) after failedList

If you are saving the failedList, then there is no need to modify the 
toDoList - so I'd simply c

Re: Speeding up get URL

2008-08-05 Thread Brian Yennie

Shari,

I'm not sure there is much you can do to speed up the fetching of  
URLs, but my two suggestions would be:


1) See if you can process more than one download at a time - this will  
be more complex to code, but may be a bit faster so that 1 slow  
download doesn't affect another. Of course it will still scale just as  
poorly.


2) Assuming there isn't much you can do to download the pages faster,  
perhaps you could look into having the script run on a schedule at odd  
hours when you are not in front of the computer. For example, if it  
ran incrementally downloading pages for an hour every night at 2AM you  
might not notice all of the time going in to it.


Hope that helps a little.

I'd do that in a heartbeat if they had a way.  They used to, but at  
this time the only offering they have is for affiliates, and it has  
severe limitations.  I just got done checking it out and it isn't  
designed for what I need.  I might be able to "fudge" it and I will  
give fudging a try.  But if the fudge fails I'm back to the way that  
works.  Though based on some math calcs the fudge could work out  
faster... worth a look.


What's frustrating is that there is a way and a whole bunch of  
shopkeepers have access to that way, but newer shopkeepers do not  
have access.


Shari


One suggestion is to send an email to the support group for the  
"one domain"
and ask if there is a better way to get the info you want.  You  
would have
to say you are willing to register and play by their rules, but  
this could
get all the data you want in 20-30 minutes.  They might even allow  
you to

download one file from their FTP site.

If you are in the business of making more money for them, they will  
likely

help you.  They may already have such a service.

Jim Ault
Las Vegas



--
Dogs and bears, sports and cars, and patriots t-shirts
http://www.villagetshirts.com
WlND0WS and MAClNT0SH shareware games
http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your  
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-revolution




___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Shari

Cable modem, yes.  CGI, I don't know a word of the language.

So even your hosting ISP can get involved?  Lordy,  I had no idea 
there were so many pitfalls.  I can understand their issues however, 
knowing how much spam I get from people who are probably using 
similar searches for bad things.  It's disgusting to open your 
mailbox and have hundreds of junk mail just to get to the few genuine 
ones.  Opens up the whole issue, how do you allow someone who isn't 
using it for a bad purpose versus someone who is.  Do you disallow 
all?  Do you allow all?  Or do you find some way for folks to 
register and "show" what they're doing with it?  So yes, I do see 
their point and it is a valid one.  I don't mind registering with 
Google or Yahoo if their API's do the thing.


It's a lot like the whole shareware issue.  Once upon a time you 
could put it out there and hope for honest folks to throw you a bone. 
The truth is that even honest folks don't give a thought to sharing 
software with their friends and family, they don't even think about 
it.  So you MUST build in some incentive if you want to get paid. 
Whole studies have been done on this and in its weird way it's 
similar.  The folks who would have paid anyway sometimes get offended 
by whatever incentives are built in to ensure you get your bones.


By that token the folks using searches for "honest" purposes versus 
those using them for spam and so forth... never considered the 
possibility of being mistaken for a bad guy.


Good catch on the "delete from success list" by the way.  I missed 
that, might not be a huge difference but when the list is very long 
it could be.  Actually, if delete were not needed I could switch to 
"repeat for each" which is always faster.


Every little speed up helps it along :-)

Shari


Are you running these from a machine behind a (relatively) slow 
Internet connection, such as a DSL or cable modem ?
If so, you might get a big improvement by converting the script into 
a CGI script, and running it on your own web-hosting server; that 
would give you an effective bandwidth based on the ISP, rather than 
on a slow DSL-like connection. (I have a vaguely similar script I 
run from my site that is approx 1000x times faster than running it 
from home on a 8Mbs DSL - the lower latency helps as much as the 
increased bandwidth). But beware - if there are any issues with 
looking like a DoS attack, or sending too many requests per second, 
this might be much more likely to trigger them; you may also run 
into issues with usage of CPU and/or bandwidth on your hosting-ISP.



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Shari
Search engines have API's?  I did not know that.  I will definitely 
look into this.  I didn't realize I had so many different options to 
choose from.  Options are good, very good indeed :-)


Thank you!

Shari



I believe most of the major search engines have APIs for returning search
results as XML.  I certainly used Yahoo to to this before.

You'd need to look at the APIs for search to see if they contain features
that would work for the kind of information you're trying to find.

Bernard



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Mark Smith
Very good point about doing it from a remote server - if the speed  
difference were great, then an hourly-paid Amazon EC2 server might be  
just the job...


Mark

On 4 Aug 2008, at 13:13, Alex Tweedly wrote:


If so, you might get a big improvement by converting the script  
into a CGI script, and running it on your own web-hosting server;  
that would give you an effective bandwidth based on the ISP, rather  
than on a slow DSL-like connection. (I have a vaguely similar  
script I run from my site that is approx 1000x

...
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Bernard Devlin
I think you've had a lot of good suggestions for solving this problem.
However, depending on the kind of data you're trying to parse out (and the
frequency with which that data changes), you might be better to let Google
or Yahoo do the search (using the kind of advanced search like:

"some meaningful phrase" OR "another meaningful phrase" allinurl:
somedomain.com

The search engine would then return the results to you, and you could then
proceed to download the actual pages that match, and parse them further.

I believe most of the major search engines have APIs for returning search
results as XML.  I certainly used Yahoo to to this before.

You'd need to look at the APIs for search to see if they contain features
that would work for the kind of information you're trying to find.

Bernard
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Rick Harrison


On Aug 4, 2008, at 4:17 AM, Shari wrote:

One service provider that I extract data from does not want more  
than one
hit every 50 seconds in order to be of service to hundreds of  
simultaneous
users, so they protect themselves from "denial of service attacks"  
that

overload their machines.



Hi Shari,

What URL are you harvesting all the data from?

Most affiliate type websites usually don't want users
doing this activity.  If you've found one that allows it
we'd like to know more about it!

Thanks in advance!

Rick
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Alex Tweedly
 Sorry if this message comes through twice - first attempt might have 
failed, so I'm resending form a different account.


Sarah Reichelt wrote:

On Mon, Aug 4, 2008 at 12:35 AM, Shari <[EMAIL PROTECTED]> wrote:
  

Goal:  Get a long list of website URLS, parse a bunch of data from each
page, if successful delete the URL from the list, if not put the URL on a
different list.  I've got it working but it's slow.  It takes about an hour
per 10,000 urls.  I sell tshirts.  Am using this to create informational
files for myself which will be frequently updated.  I'll probably be running
this a couple times a month and expect my product line to just keep on
growing.  I'm currently at about 40,000 products but look forward to the day
of hundreds of thousands :-)  So speed is my need... (Yes, if you're
interested my store is in the signature, opened it last December :-)

How do I speed this up?



Shari, I think the delay will be due to the connection to the server,
not your script, so there may not be a lot you can do about it.

I did have one idea: can you try getting more than one URL at the same
time? If you build a list of the URLs to check, then have a script
that grabs the first one on the list, and sends a non-blocking request
to that site, with a message to call when the data has all arrived.
While waiting, start loading the next site and so on. Bookmark
checking software seems to work like this.

  
You should be able to achieve that using 'load URL' - set off a number 
of 'load's going and then by checking the URLstatus you can process them 
as they have finished arriving to your machine; and as the number of 
outstanding requested URLs decreases, set off the next batch of 'load's.


But the likelihood is that this would only make a small difference - the 
majority of the time is probably due to either the server response times 
and/or the delay in simply downloading all those bytes to your machine.  
Out of interest I'd be inclined to count the number of bytes transferred 
per URL and see if that is a significant percentage of your connection 
capacity.


Are you running these from a machine behind a (relatively) slow Internet 
connection, such as a DSL or cable modem ?
If so, you might get a big improvement by converting the script into a 
CGI script, and running it on your own web-hosting server; that would 
give you an effective bandwidth based on the ISP, rather than on a slow 
DSL-like connection. (I have a vaguely similar script I run from my site 
that is approx 1000x times faster than running it from home on a 8Mbs 
DSL - the lower latency helps as much as the increased bandwidth). But 
beware - if there are any issues with looking like a DoS attack, or 
sending too many requests per second, this might be much more likely to 
trigger them; you may also run into issues with usage of CPU and/or 
bandwidth on your hosting-ISP.

Would opening a socket and reading from the socket be any faster? I
don't imagine that it would be, but it might be worth checking.

The other option is just to adjust things so it is not intrusive e.g.
have it download the sites overnight and save them all for processing
when you are ready, or have a background app that does the downloading
slowly (so it doesn't overload your system).
  
On that same idea, but taking it further (maybe too far) - how 
absolutely up-to-date does the info need to be when you run the script ?
Could you process a few thousand URLs per night, caching either the URLs 
as files locally, or caching the extracted data from them. Then when you 
want to run your script, you use all the cached data - so some of it is 
right up to date, while other parts may be up to a few days old.  You 
may also know, or be able to find out, which of the URLs tend to change 
frequently, and therefore bias the background processing accordingly.




And, finally, a couple of trivial issues .



# toDoList needs to have the successful URLs deleted, and failed URLs 
moved to a different list

# that's why p down to 1, for the delete
# URLS are standard http://www.somewhere.com/somePage

  repeat with p = the number of lines of toDoList down to 1
  put url (line p of toDoList) into tUrl
  # don't want to use *it* because there's a long script that follows
  # *it* is too easily changed, though I've heard *it* is faster 
than *put*

  # do the stuff
  if doTheStuffWorked then
 delete line p of toDoList
  else put p & return after failedList
  updateProgressBar # another slowdown but necessary, gives a 
count of how many left to do
   end repeat 
I don't fully understand this (??). What you describe is doing BOTH 
delete the successful ones, and ALSO save the failed ones - so at the 
end, toDoList should finish up the same as failedList. But what your 
pseudo-code actually does is save the indexes of the failed URLs - which 
become invalid once you delete lower numbered lines; I think you 
intended to do

else put (line p of to

Re: Speeding up get URL

2008-08-04 Thread Shari

One service provider that I extract data from does not want more than one
hit every 50 seconds in order to be of service to hundreds of simultaneous
users, so they protect themselves from "denial of service attacks" that
overload their machines.


I did notice that even with their affiliate XML file access there's a 
mention of "too many requests per second" which would produce an 
error.  So they probably do have some sort of built in safety valve.


By the way, thank you Jim for prodding me to look into their back 
door.  I will be getting some of the data that way.


Shari
--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-04 Thread Shari
Good suggestions, Sarah.  Thank you!  I've settled on a solution 
that's going to partly go in the back door (retrieving their XML data 
via their affiliate door) and partly go in the front door (get or 
load url).  So I'll parse what I can from their affiliate XML files 
and do the rest the other way.  That should cut down on the load to 
them, speed it up, and still give me all the data I'm seeking.  It 
should cut the pages for get/load url down to a small percentage of 
the total retrieve.


I'm excited for the end result as this is going to really be 
beneficial to my business :-)


Two more days to hopefully find out if I sold Herman Silver on their 
own copy of Rev... It kills me to do things in Excel there that could 
be done sooo much better with Rev... Excel is so limiting when you 
have too many sheets tying in to other sheets I've discovered.  And 
all day long I'm thinking gee, this would be so much better with 
Revolution.


:-)
Shari



Shari, I think the delay will be due to the connection to the server,
not your script, so there may not be a lot you can do about it.

I did have one idea: can you try getting more than one URL at the same
time? If you build a list of the URLs to check, then have a script
that grabs the first one on the list, and sends a non-blocking request
to that site, with a message to call when the data has all arrived.
While waiting, start loading the next site and so on. Bookmark
checking software seems to work like this.

Would opening a socket and reading from the socket be any faster? I
don't imagine that it would be, but it might be worth checking.

The other option is just to adjust things so it is not intrusive e.g.
have it download the sites overnight and save them all for processing
when you are ready, or have a background app that does the downloading
slowly (so it doesn't overload your system).

Cheers,
Sarah



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Jim Ault
The major limitation for you case is that each request sent to a web server
is dependent on the response time from that web server.  Some servers
intentionally return with a delay to control bandwidth demands and load
balancing, especially if some of their hosted customers are downloading
videos or games or music or flash files.

You could add a timer to you process and find out which sites return the
slowest, but this may not be true every time you access same site.

The rate of 10,000 per hour sounds about right since most product pages on
the internet are served from a database and are not static pages.

One service provider that I extract data from does not want more than one
hit every 50 seconds in order to be of service to hundreds of simultaneous
users, so they protect themselves from "denial of service attacks" that
overload their machines.

One of my hosting companies does not charge extra for bandwidth, but has set
the load balancing so that I could not serve videos effectively.  Works
great for blogs and low-volume pages.  Not good for music and games.

I would just let the app run overnight since you are only doing it a couple
times a month.  Of course you could run the app on two computers to double
the speed, then merge the results.

Another step might be to see if the sites have a listing page with you
desired data.  This might produce 10-50 products on one page rather than
10-50 different web pages.

Jim Ault
Las Vegas


On 8/3/08 7:35 AM, "Shari" <[EMAIL PROTECTED]> wrote:

> Goal:  Get a long list of website URLS, parse a bunch of data from
> each page, if successful delete the URL from the list, if not put the
> URL on a different list.  I've got it working but it's slow.  It
> takes about an hour per 10,000 urls.  I sell tshirts.  Am using this
> to create informational files for myself which will be frequently
> updated.  I'll probably be running this a couple times a month and
> expect my product line to just keep on growing.  I'm currently at
> about 40,000 products but look forward to the day of hundreds of
> thousands :-)  So speed is my need... (Yes, if you're interested my
> store is in the signature, opened it last December :-)
> 
> How do I speed this up?
> 
> # toDoList needs to have the successful URLs deleted, and failed URLs
> moved to a different list
> # that's why p down to 1, for the delete
> # URLS are standard http://www.somewhere.com/somePage
> 
>repeat with p = the number of lines of toDoList down to 1
>put url (line p of toDoList) into tUrl
># don't want to use *it* because there's a long script that follows
># *it* is too easily changed, though I've heard *it* is faster than
> *put*
># do the stuff
>if doTheStuffWorked then
>   delete line p of toDoList
>else put p & return after failedList
>updateProgressBar # another slowdown but necessary, gives a
> count of how many left to do
> end repeat


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Sarah Reichelt
On Mon, Aug 4, 2008 at 12:35 AM, Shari <[EMAIL PROTECTED]> wrote:
> Goal:  Get a long list of website URLS, parse a bunch of data from each
> page, if successful delete the URL from the list, if not put the URL on a
> different list.  I've got it working but it's slow.  It takes about an hour
> per 10,000 urls.  I sell tshirts.  Am using this to create informational
> files for myself which will be frequently updated.  I'll probably be running
> this a couple times a month and expect my product line to just keep on
> growing.  I'm currently at about 40,000 products but look forward to the day
> of hundreds of thousands :-)  So speed is my need... (Yes, if you're
> interested my store is in the signature, opened it last December :-)
>
> How do I speed this up?

Shari, I think the delay will be due to the connection to the server,
not your script, so there may not be a lot you can do about it.

I did have one idea: can you try getting more than one URL at the same
time? If you build a list of the URLs to check, then have a script
that grabs the first one on the list, and sends a non-blocking request
to that site, with a message to call when the data has all arrived.
While waiting, start loading the next site and so on. Bookmark
checking software seems to work like this.

Would opening a socket and reading from the socket be any faster? I
don't imagine that it would be, but it might be worth checking.

The other option is just to adjust things so it is not intrusive e.g.
have it download the sites overnight and save them all for processing
when you are ready, or have a background app that does the downloading
slowly (so it doesn't overload your system).

Cheers,
Sarah
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Shari
I'd do that in a heartbeat if they had a way.  They used to, but at 
this time the only offering they have is for affiliates, and it has 
severe limitations.  I just got done checking it out and it isn't 
designed for what I need.  I might be able to "fudge" it and I will 
give fudging a try.  But if the fudge fails I'm back to the way that 
works.  Though based on some math calcs the fudge could work out 
faster... worth a look.


What's frustrating is that there is a way and a whole bunch of 
shopkeepers have access to that way, but newer shopkeepers do not 
have access.


Shari



One suggestion is to send an email to the support group for the "one domain"
and ask if there is a better way to get the info you want.  You would have
to say you are willing to register and play by their rules, but this could
get all the data you want in 20-30 minutes.  They might even allow you to
download one file from their FTP site.

If you are in the business of making more money for them, they will likely
help you.  They may already have such a service.

Jim Ault
Las Vegas



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Shari

Noel,

I've done a bit of research and I don't think they have such issues. 
Several folks are doing similar things very publicly (the website is 
aware of it) and it doesn't seem to be a problem.  Usually if 
something is disallowed you'll find it referenced very clearly in 
their user forums.


For instance, images and words that are disallowed on your products :-)

If somebody does something and it's considered bad by the company, 
it's usually well documented for others to find and read about.


So I think I'm okay along this path.  We're all in the business to 
make money and if we, the shopkeepers, find ways to improve our 
sales, then the company that supports us makes more money, too :-)


If they were smart they'd be offering us built in tools so we 
wouldn't have to roll our own.  But you know, there's nothing better 
than Rev for rolling your own!


Keep that doggie rolling...

Shari


Yes, something like what you are describing could easily be confused 
with a DOS attack.


DOS attacks are done by flooding a server with requests for webpages 
to the point that the server crashes due to its inability to process 
all the requests.


Even if you are not considered a DOS attack, the company may not 
appreciate the bandwith that is being used for you to continually 
index their site and may deny your IP address access at some point.


 - Noel



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Jim Ault
Noel is correct.

Even Google will ban IP addresses of those machines that will execute too
many searches in a short time.  One answer is to use proxy servers, but that
is a more complex process.

One suggestion is to send an email to the support group for the "one domain"
and ask if there is a better way to get the info you want.  You would have
to say you are willing to register and play by their rules, but this could
get all the data you want in 20-30 minutes.  They might even allow you to
download one file from their FTP site.

If you are in the business of making more money for them, they will likely
help you.  They may already have such a service.

Jim Ault
Las Vegas

On 8/3/08 9:56 AM, "Noel" <[EMAIL PROTECTED]> wrote:

> Yes, something like what you are describing could easily be confused
> with a DOS attack.
> 
> DOS attacks are done by flooding a server with requests for webpages
> to the point that the server crashes due to its inability to process
> all the requests.
> 
> Even if you are not considered a DOS attack, the company may not
> appreciate the bandwith that is being used for you to continually
> index their site and may deny your IP address access at some point.
> 
>   - Noel
> 
> At 10:38 AM 8/3/2008, you wrote:
>> It's always one domain, the same domain, and I have no control over
>> the domain or its hosting company.  The domain itself probably has
>> millions of pages.  Anybody can sell products thru them, and they
>> make it very easy to do so.  So there are probably thousands (or
>> more) folks with massive quantities of products for sale.
>> 
>> The worst thing will be if 40,000 products takes 4.5 hours (my last
>> two runs confirmed that), once it gets to hundreds of thousands it
>> could take *days*.  That would be bad.
>> 
>> I'm having other options for partial runs, but there's always got to
>> be the ability for a full run.
>> 
>> I'm not familiar with the ins and outs of a DOS attack, you mean
>> something like this could be confused for one?
>> 
>>> Just a thought...  One factor might be if your list has the same domain
>>> appearing as a contiguous block, the web server may be detecting that you
>>> are not a human browsing and slow down the transfer rate.  One of my hosting
>>> companies does this because they had bad experiences with denial of service
>>> attacks.
>>> 
>>> Hope this helps.
>>> 
>>> Jim Ault
>>> Las Vegas
>> 
>> 
>> --
>>   Dogs and bears, sports and cars, and patriots t-shirts
>>   http://www.villagetshirts.com
>>  WlND0WS and MAClNT0SH shareware games
>>  http://www.gypsyware.com
>> ___
>> use-revolution mailing list
>> use-revolution@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
> 
> ___
> use-revolution mailing list
> use-revolution@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Noel
Yes, something like what you are describing could easily be confused 
with a DOS attack.


DOS attacks are done by flooding a server with requests for webpages 
to the point that the server crashes due to its inability to process 
all the requests.


Even if you are not considered a DOS attack, the company may not 
appreciate the bandwith that is being used for you to continually 
index their site and may deny your IP address access at some point.


 - Noel

At 10:38 AM 8/3/2008, you wrote:
It's always one domain, the same domain, and I have no control over 
the domain or its hosting company.  The domain itself probably has 
millions of pages.  Anybody can sell products thru them, and they 
make it very easy to do so.  So there are probably thousands (or 
more) folks with massive quantities of products for sale.


The worst thing will be if 40,000 products takes 4.5 hours (my last 
two runs confirmed that), once it gets to hundreds of thousands it 
could take *days*.  That would be bad.


I'm having other options for partial runs, but there's always got to 
be the ability for a full run.


I'm not familiar with the ins and outs of a DOS attack, you mean 
something like this could be confused for one?



Just a thought...  One factor might be if your list has the same domain
appearing as a contiguous block, the web server may be detecting that you
are not a human browsing and slow down the transfer rate.  One of my hosting
companies does this because they had bad experiences with denial of service
attacks.

Hope this helps.

Jim Ault
Las Vegas



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-revolution


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Shari
It's always one domain, the same domain, and I have no control over 
the domain or its hosting company.  The domain itself probably has 
millions of pages.  Anybody can sell products thru them, and they 
make it very easy to do so.  So there are probably thousands (or 
more) folks with massive quantities of products for sale.


The worst thing will be if 40,000 products takes 4.5 hours (my last 
two runs confirmed that), once it gets to hundreds of thousands it 
could take *days*.  That would be bad.


I'm having other options for partial runs, but there's always got to 
be the ability for a full run.


I'm not familiar with the ins and outs of a DOS attack, you mean 
something like this could be confused for one?



Just a thought...  One factor might be if your list has the same domain
appearing as a contiguous block, the web server may be detecting that you
are not a human browsing and slow down the transfer rate.  One of my hosting
companies does this because they had bad experiences with denial of service
attacks.

Hope this helps.

Jim Ault
Las Vegas



--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Jim Ault
The major limitation for your case is that each request sent to a web server
is dependent on the response time from that web server.  Some servers
intentionally return with a delay to control bandwidth demands and load
balancing, especially if some of their hosted customers are downloading
videos or games or music or flash files.

You could add a timer to you process and find out which sites return the
slowest, but this may not be true every time you access same site.

The rate of 10,000 per hour sounds about right since most product pages on
the internet are served from a database and are not static pages.

One service provider that I extract data from does not want more than one
hit every 50 seconds in order to be of service to hundreds of simultaneous
users, so they protect themselves from "denial of service attacks" that
overload their machines.

One of my hosting companies does not charge extra for bandwidth, but has set
the load balancing so that I could not serve videos effectively.  Works
great for blogs and low-volume pages.  Not good for music and games.

I would just let the app run overnight since you are only doing it a couple
times a month.  Of course you could run the app on two computers to double
the speed, then merge the results.

Another step might be to see if the sites have a listing page with you
desired data.  This might produce 10-50 products on one page rather than
10-50 different web pages.

Just a thought...  One factor might be if your list has the same domain
appearing as a contiguous block, the web server may be detecting that you
are not a human browsing and slow down the transfer rate.  One of my hosting
companies does this because they had bad experiences with denial of service
attacks.

Hope this helps.

Jim Ault
Las Vegas


On 8/3/08 7:35 AM, "Shari" <[EMAIL PROTECTED]> wrote:

> Goal:  Get a long list of website URLS, parse a bunch of data from
> each page, if successful delete the URL from the list, if not put the
> URL on a different list.  I've got it working but it's slow.  It
> takes about an hour per 10,000 urls.  I sell tshirts.  Am using this
> to create informational files for myself which will be frequently
> updated.  I'll probably be running this a couple times a month and
> expect my product line to just keep on growing.  I'm currently at
> about 40,000 products but look forward to the day of hundreds of
> thousands :-)  So speed is my need... (Yes, if you're interested my
> store is in the signature, opened it last December :-)
> 
> How do I speed this up?
> 
> # toDoList needs to have the successful URLs deleted, and failed URLs
> moved to a different list
> # that's why p down to 1, for the delete
> # URLS are standard http://www.somewhere.com/somePage
> 
>repeat with p = the number of lines of toDoList down to 1
>put url (line p of toDoList) into tUrl
># don't want to use *it* because there's a long script that follows
># *it* is too easily changed, though I've heard *it* is faster than
> *put*
># do the stuff
>if doTheStuffWorked then
>   delete line p of toDoList
>else put p & return after failedList
>updateProgressBar # another slowdown but necessary, gives a
> count of how many left to do
> end repeat


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Shari

I wonder if using "load" URL might be faster?

sims


I haven't tried it.  The docs made it seem like the wrong choice as 
the url must be fully loaded for the handler to continue.  I check 
this by looking for  in the fetched url.


According to the docs "load" downloads the url in the background and 
doesn't wait for the complete download before continuing the script.


I'm not sure how to counter this.  Would you create an array to hold 
the url address and contents?  Each address being a key, and each 
contents being the contents of the page?  And then run the parsing 
handler on the array?


So that while it's parsing this url, in the background it's still 
loading other urls into the array?





 From the docs:
The load command is non-blocking, so it does not stop the current 
handler while the download is completed. The handler continues while 
the load command downloads the URL in the background. You can monitor 
the download by checking the URLStatus function periodically.


  Caution!  Avoid using the wait command in a handler after executing 
the load command. Since the load command is non-blocking, it may 
still be running when your handler reaches the wait command. And 
since the load command is part of the Internet library and is 
implemented in a handler, the wait command will stop the download 
process if it is executed while the download is still going on. In 
particular, do not use constructions like the following, which will 
sit forever without downloading the file:


  load URL myURL
  wait until the URLStatus of myURL is "cached" -- DON'T DO THIS

The file is downloaded into a local cache. It does not remain 
available after the application quits; the purpose of the cache is to 
speed up access to the specified URL, not to store it permanently. 
You can use a URL even if it is not in the cache, so use of the load 
command is optional.

<<



All actions that refer to a URL container are blocking: that is, the 
handler pauses until Revolution is finished accessing the URL. Since 
fetching a web page may take some time due to network lag, accessing 
URLs may take long enough to be noticeable to the user. To avoid this 
delay, use the load command (which is non-blocking) to cache web 
pages before you need them.

<<


--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Speeding up get URL

2008-08-03 Thread Jim Sims


On Aug 3, 2008, at 4:35 PM, Shari wrote:

Goal:  Get a long list of website URLS, parse a bunch of data from  
each page, if successful delete the URL from the list, if not put  
the URL on a different list.  I've got it working but it's slow.  It  
takes about an hour per 10,000 urls.  I sell tshirts.  Am using this  
to create informational files for myself which will be frequently  
updated.  I'll probably be running this a couple times a month and  
expect my product line to just keep on growing.  I'm currently at  
about 40,000 products but look forward to the day of hundreds of  
thousands :-)  So speed is my need... (Yes, if you're interested my  
store is in the signature, opened it last December :-)


How do I speed this up?




I wonder if using "load" URL might be faster?


sims

ClipaSearch Pro
http://www.einspine.com

Across Platforms - Code and Culture
http://www.ezpzapps.com/blog/




___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Speeding up get URL

2008-08-03 Thread Shari
Goal:  Get a long list of website URLS, parse a bunch of data from 
each page, if successful delete the URL from the list, if not put the 
URL on a different list.  I've got it working but it's slow.  It 
takes about an hour per 10,000 urls.  I sell tshirts.  Am using this 
to create informational files for myself which will be frequently 
updated.  I'll probably be running this a couple times a month and 
expect my product line to just keep on growing.  I'm currently at 
about 40,000 products but look forward to the day of hundreds of 
thousands :-)  So speed is my need... (Yes, if you're interested my 
store is in the signature, opened it last December :-)


How do I speed this up?

# toDoList needs to have the successful URLs deleted, and failed URLs 
moved to a different list

# that's why p down to 1, for the delete
# URLS are standard http://www.somewhere.com/somePage

  repeat with p = the number of lines of toDoList down to 1
  put url (line p of toDoList) into tUrl
  # don't want to use *it* because there's a long script that follows
  # *it* is too easily changed, though I've heard *it* is faster than *put*
  # do the stuff
  if doTheStuffWorked then
 delete line p of toDoList
  else put p & return after failedList
  updateProgressBar # another slowdown but necessary, gives a 
count of how many left to do

   end repeat
--
  Dogs and bears, sports and cars, and patriots t-shirts
  http://www.villagetshirts.com
 WlND0WS and MAClNT0SH shareware games
 http://www.gypsyware.com
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution