On 4 Aug 2008, at 13:00, Alex Tweedly wrote:
You should be able to achieve that using 'load URL' - set off a
number of 'load's going and then by checking the URLstatus you can
process them as they have finished arriving to your machine; and as
the number of outstanding requested URLs decr
Sarah Reichelt wrote:
On Mon, Aug 4, 2008 at 12:35 AM, Shari <[EMAIL PROTECTED]> wrote:
Goal: Get a long list of website URLS, parse a bunch of data from each
page, if successful delete the URL from the list, if not put the URL on a
different list. I've got it working but it's slow. It tak
Shari,
I'm not sure there is much you can do to speed up the fetching of
URLs, but my two suggestions would be:
1) See if you can process more than one download at a time - this will
be more complex to code, but may be a bit faster so that 1 slow
download doesn't affect another. Of course
Cable modem, yes. CGI, I don't know a word of the language.
So even your hosting ISP can get involved? Lordy, I had no idea
there were so many pitfalls. I can understand their issues however,
knowing how much spam I get from people who are probably using
similar searches for bad things. I
Search engines have API's? I did not know that. I will definitely
look into this. I didn't realize I had so many different options to
choose from. Options are good, very good indeed :-)
Thank you!
Shari
I believe most of the major search engines have APIs for returning search
results as
Very good point about doing it from a remote server - if the speed
difference were great, then an hourly-paid Amazon EC2 server might be
just the job...
Mark
On 4 Aug 2008, at 13:13, Alex Tweedly wrote:
If so, you might get a big improvement by converting the script
into a CGI script, an
I think you've had a lot of good suggestions for solving this problem.
However, depending on the kind of data you're trying to parse out (and the
frequency with which that data changes), you might be better to let Google
or Yahoo do the search (using the kind of advanced search like:
"some meaning
On Aug 4, 2008, at 4:17 AM, Shari wrote:
One service provider that I extract data from does not want more
than one
hit every 50 seconds in order to be of service to hundreds of
simultaneous
users, so they protect themselves from "denial of service attacks"
that
overload their machines.
Sorry if this message comes through twice - first attempt might have
failed, so I'm resending form a different account.
Sarah Reichelt wrote:
On Mon, Aug 4, 2008 at 12:35 AM, Shari <[EMAIL PROTECTED]> wrote:
Goal: Get a long list of website URLS, parse a bunch of data from each
page, if s
One service provider that I extract data from does not want more than one
hit every 50 seconds in order to be of service to hundreds of simultaneous
users, so they protect themselves from "denial of service attacks" that
overload their machines.
I did notice that even with their affiliate XML fi
Good suggestions, Sarah. Thank you! I've settled on a solution
that's going to partly go in the back door (retrieving their XML data
via their affiliate door) and partly go in the front door (get or
load url). So I'll parse what I can from their affiliate XML files
and do the rest the other
The major limitation for you case is that each request sent to a web server
is dependent on the response time from that web server. Some servers
intentionally return with a delay to control bandwidth demands and load
balancing, especially if some of their hosted customers are downloading
videos or
On Mon, Aug 4, 2008 at 12:35 AM, Shari <[EMAIL PROTECTED]> wrote:
> Goal: Get a long list of website URLS, parse a bunch of data from each
> page, if successful delete the URL from the list, if not put the URL on a
> different list. I've got it working but it's slow. It takes about an hour
> per
I'd do that in a heartbeat if they had a way. They used to, but at
this time the only offering they have is for affiliates, and it has
severe limitations. I just got done checking it out and it isn't
designed for what I need. I might be able to "fudge" it and I will
give fudging a try. But
Noel,
I've done a bit of research and I don't think they have such issues.
Several folks are doing similar things very publicly (the website is
aware of it) and it doesn't seem to be a problem. Usually if
something is disallowed you'll find it referenced very clearly in
their user forums.
Noel is correct.
Even Google will ban IP addresses of those machines that will execute too
many searches in a short time. One answer is to use proxy servers, but that
is a more complex process.
One suggestion is to send an email to the support group for the "one domain"
and ask if there is a bet
Yes, something like what you are describing could easily be confused
with a DOS attack.
DOS attacks are done by flooding a server with requests for webpages
to the point that the server crashes due to its inability to process
all the requests.
Even if you are not considered a DOS attack, the
It's always one domain, the same domain, and I have no control over
the domain or its hosting company. The domain itself probably has
millions of pages. Anybody can sell products thru them, and they
make it very easy to do so. So there are probably thousands (or
more) folks with massive quan
The major limitation for your case is that each request sent to a web server
is dependent on the response time from that web server. Some servers
intentionally return with a delay to control bandwidth demands and load
balancing, especially if some of their hosted customers are downloading
videos o
I wonder if using "load" URL might be faster?
sims
I haven't tried it. The docs made it seem like the wrong choice as
the url must be fully loaded for the handler to continue. I check
this by looking for in the fetched url.
According to the docs "load" downloads the url in the background
On Aug 3, 2008, at 4:35 PM, Shari wrote:
Goal: Get a long list of website URLS, parse a bunch of data from
each page, if successful delete the URL from the list, if not put
the URL on a different list. I've got it working but it's slow. It
takes about an hour per 10,000 urls. I sell ts
Goal: Get a long list of website URLS, parse a bunch of data from
each page, if successful delete the URL from the list, if not put the
URL on a different list. I've got it working but it's slow. It
takes about an hour per 10,000 urls. I sell tshirts. Am using this
to create informational
22 matches
Mail list logo