[Robots] LWP (was RE: Re: better language for writing a Spider ?)
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Sean M. Burke ... > At 10:36 2002-03-14 -0800, Nick Arnett wrote: > >[with Python] I'm not seeing long, mysterious time-outs as I > occasionally > >did with LWP. > > I have never run into this problem, but I have a dim memory that > you may be > alluding to what is a known bug not with LWP, but with old versions (long > since fixed in modern Perls and/or CPAN) of the socket libraries in IO::*. I'm very diligent about updating, so I doubt if I was seeing an old bug. What I would see would be a series of time-outs, usually no more than 10 in a row (I limited re-tries to 10, with increasing delays between them in case it was a server busy issue). But I should make it clear that the bug producing the error message out of expat was completely separate. > >Following an LWP request through the debugger is a long and convoluted > >journey... > > Are you referring to perl -d, or LWP::Debug? Sorry for not specifying. I was using the ActiveState graphical debugger on Windows, although sometimes the code was actually running on Linux. Same behavior on both, though. I did give LWP::Debug a shot, but still could see where the error code was getting introduced. Wish I could recall better specifics, but it's been a few weeks. As I recall, the server was returning an error, suggesting that there was something malformed about the request I sent it, and that error was being mistranslated in the expat DLL... and I recall having trouble even figuring out where expat got involved in the mess. > Maybe I should write an addendum to "lwpcook.pod" on figuring out what's > going wrong, when something does go wrong. The current lwpcook really > needs an overhaul, and once my book /Perl & LWP/ is done (hopefully it'll > be in press within a few weeks), I hope to send up some big doc > patches to > LWP, at the very least revamping lwpcook and then going into each > class and > noting in the docs whether a typical user needs to bother knowing about > it. (E.g., you need to know about HTTP::Response; you do /not/ need to > know about LWP::Protocol.) That would really be good. Examples, examples, examples. I learn by doing, not by reading, and I think there are a fair number of people like me out there. > In short, if people want to see improvements to LWP, email me and > say what > you want done, and I'll either try my hand at implementing it, or > I'll pass > it on to someone more capable. A re-try mechanism would be terrific. Mine is fairly straightforward. The parameters are a max number of tries, a delay factor that optionally rises with each try, and a logging method that details as much as possible about each failure. The latter is where some work on the internals would be helpful, to disambiguate error messages as much as possible. Perhaps a simple way to kick in LWP::Debug with appropriate parameters and log the results if repeated failures occur? I always want to see exactly what the outgoing request was and the server's actual response, so I know whether the request is munged or the server is being difficult... not that that's always clear. Nick -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: better language for writing a Spider ?
Sean M. Burke wrote: > In short, if people want to see improvements to LWP, email me and say what > you want done For robots, you need a call that says "fetch this URL, but get a maximum of XX bytes and spend a maximum of YY seconds doing it." Return status should tell you whether it finished or timed out, and how many bytes were actually retrieved. BTW, have the LWP timeouts been fixed? As recently as early 2000, they were known to generally not work. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: better language for writing a Spider ?
srinivas mohan wrote: > can you help me suggesting any open source compilers > to compile my java code to native code... I suggest that this is unlikely to help. Whenever a computer program is not runnning fast enough, the first step MUST BE to measure it and understand why. Use a profiler. Or write a logfile with lots of timestamps. What is your robot spending its time doing? Until you know this, any time spent trying to make it go faster is wasted. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: SV: matching and "User-Agent:" in robots.txt
Here is how you set it in Java. Of course, there is a lot more code involved such as try/catch blocks and stuff. String referer = "http://spider.desertrealm.com";; String user_agent = "DesertRealm.com; 0.2; [J];"; URL hp = new URL(currentURL); URLConnection hpCon = hp.openConnection(); HttpURLConnection uc = (HttpURLConnection)hpCon; hpCon.setRequestProperty ("Referer", referer); hpCon.setRequestProperty ("User-Agent", user_agent); Thanks, Brian -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Friday, March 15, 2002 1:58 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: [Robots] Re: SV: matching and "User-Agent:" in robots.txt On Thursday, March 14, 2002, at 10:59 , Thomas Huber wrote: > > > How can the UA be set in Java? > >> Create a user-agent object thus: >> >> "$ua = LWP::RobotUA->new('Banjo/1.1','http://nowhere.int/banjo.html >> [EMAIL PROTECTED]') There is no explicit setUserAgent call. However, you can set the header itself. I forget the exact call, but it's something like urlConnection.setHeader("User-Agent", "Banjo/1.1"). If you want to set the From header you'll have to use this method too. -- Mike -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: better language for writing a Spider ?
At 10:36 2002-03-14 -0800, Nick Arnett wrote: >[with Python] I'm not seeing long, mysterious time-outs as I occasionally >did with LWP. I have never run into this problem, but I have a dim memory that you may be alluding to what is a known bug not with LWP, but with old versions (long since fixed in modern Perls and/or CPAN) of the socket libraries in IO::*. >Following an LWP request through the debugger is a long and convoluted >journey... Are you referring to perl -d, or LWP::Debug? Maybe I should write an addendum to "lwpcook.pod" on figuring out what's going wrong, when something does go wrong. The current lwpcook really needs an overhaul, and once my book /Perl & LWP/ is done (hopefully it'll be in press within a few weeks), I hope to send up some big doc patches to LWP, at the very least revamping lwpcook and then going into each class and noting in the docs whether a typical user needs to bother knowing about it. (E.g., you need to know about HTTP::Response; you do /not/ need to know about LWP::Protocol.) In short, if people want to see improvements to LWP, email me and say what you want done, and I'll either try my hand at implementing it, or I'll pass it on to someone more capable. LWP is not the product of a massive bureaucracy, but of few enough people that you could fit all of us in a phone booth. We're all manically busy, to varying degrees (companies to run, children to raise, books/articles/modules to write, etc.), but we do at times manage to do what needs doing, if it's pointed out clearly enough to stand out from the torrent of email messages (which I find incessantly discouraging) that manage no better than "halo I try to use LWP with hotmel but not work plz hlp k thx". -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: SV: matching and "User-Agent:" in robots.txt
On Thursday, March 14, 2002, at 10:59 , Thomas Huber wrote: > > > How can the UA be set in Java? > >> Create a user-agent object thus: >> >> "$ua = LWP::RobotUA->new('Banjo/1.1','http://nowhere.int/banjo.html >> [EMAIL PROTECTED]') There is no explicit setUserAgent call. However, you can set the header itself. I forget the exact call, but it's something like urlConnection.setHeader("User-Agent", "Banjo/1.1"). If you want to set the From header you'll have to use this method too. -- Mike -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: matching and "UserAgent:" in robots.txt
I dug around more in Perl LWP's WWW::RobotRules module and the short story is that the bug I found exists, but that it's not as bad as I thought. If you set up a user agent with the name "Foobar/1.23", a WWW::RobotRules object actually /does/ currently know to strip off the "/1.23" (this happens in the 'agent' method, not in the is_me method where I expected it). The current bug surfaces only when your user-agent name is more than one word; if your user-agent name is "Foobar/1.23 [[EMAIL PROTECTED]]", the current 'agent' method's logic says "well, it doesn't end in '/number.number', so there's no version to strip off". So I'm going to send Gisle Aas a patch so that the first word, minus any version suffix, is what's used for matching. It's just a matter of adding a line saying: $name = $1 if $name =~ m/(\S+)/; # get first word in the 'agent' method. -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: better language for writing a Spider ?
On Thu, 14 Mar 2002, Erick Thompson wrote: > I would suggest using C# if you are using the Windows platform. It's quite > fast, and MS provides a program to convert from Java to C#[1], so you may > save a bunch of redevelopment time. Even if you're not on Windows, there > will soon options to still use .net, such as Rotor[2] and Mono[3]. > > On a related note, I am working on a spider in C#, and if a lot of other > people are working a new spiders as well, perhaps we should look at starting > a .net open source project, based on something like the BSD/X11 license (to > allow commercial inclusion). > > Erick > > [1] > http://msdn.microsoft.com/vstudio/downloads/jca/default.asp > > [2] > http://www.oreillynet.com/pub/a/dotnet/2002/03/04/rotor.html > > [3] > http://www.go-mono.com/ --- Hy, Why not pre-compile the java code to native code ? -> No redevelopment time at all. Regards, Achim Dreyer --- A. Dreyer, UNIX System Administrator and Internet Security Consultant -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".