[Robots] LWP (was RE: Re: better language for writing a Spider ?)

2002-03-15 Thread Nick Arnett




> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Sean M. Burke

...

> At 10:36 2002-03-14 -0800, Nick Arnett wrote:
> >[with Python] I'm not seeing long, mysterious time-outs as I
> occasionally
> >did with LWP.
>
> I have never run into this problem, but I have a dim memory that
> you may be
> alluding to what is a known bug not with LWP, but with old versions (long
> since fixed in modern Perls and/or CPAN) of the socket libraries in IO::*.

I'm very diligent about updating, so I doubt if I was seeing an old bug.
What I would see would be a series of time-outs, usually no more than 10 in
a row (I limited re-tries to 10, with increasing delays between them in case
it was a server busy issue).  But I should make it clear that the bug
producing the error message out of expat was completely separate.

> >Following an LWP request through the debugger is a long and convoluted
> >journey...
>
> Are you referring to perl -d, or LWP::Debug?

Sorry for not specifying.  I was using the ActiveState graphical debugger on
Windows, although sometimes the code was actually running on Linux.  Same
behavior on both, though.  I did give LWP::Debug a shot, but still could see
where the error code was getting introduced.  Wish I could recall better
specifics, but it's been a few weeks.  As I recall, the server was returning
an error, suggesting that there was something malformed about the request I
sent it, and that error was being mistranslated in the expat DLL... and I
recall having trouble even figuring out where expat got involved in the
mess.

> Maybe I should write an addendum to "lwpcook.pod" on figuring out what's
> going wrong, when something does go wrong.  The current lwpcook really
> needs an overhaul, and once my book /Perl & LWP/ is done (hopefully it'll
> be in press within a few weeks), I hope to send up some big doc
> patches to
> LWP, at the very least revamping lwpcook and then going into each
> class and
> noting in the docs whether a typical user needs to bother knowing about
> it.  (E.g., you need to know about HTTP::Response; you do /not/ need to
> know about LWP::Protocol.)

That would really be good.  Examples, examples, examples.  I learn by doing,
not by reading, and I think there are a fair number of people like me out
there.

> In short, if people want to see improvements to LWP, email me and
> say what
> you want done, and I'll either try my hand at implementing it, or
> I'll pass
> it on to someone more capable.

A re-try mechanism would be terrific.  Mine is fairly straightforward.  The
parameters are a max number of tries, a delay factor that optionally rises
with each try, and a logging method that details as much as possible about
each failure.  The latter is where some work on the internals would be
helpful, to disambiguate error messages as much as possible.  Perhaps a
simple way to kick in LWP::Debug with appropriate parameters and log the
results if repeated failures occur?  I always want to see exactly what the
outgoing request was and the server's actual response, so I know whether the
request is munged or the server is being difficult... not that that's always
clear.

Nick


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Tim Bray


Sean M. Burke wrote:

> In short, if people want to see improvements to LWP, email me and say what 
> you want done


For robots, you need a call that says "fetch this URL, but get a maximum
of XX bytes and spend a maximum of YY seconds doing it."  Return status
should tell you whether it finished or timed out, and how many bytes
were actually retrieved.

BTW, have the LWP timeouts been fixed?  As recently as early 2000, they
were known to generally not work.  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Tim Bray


srinivas mohan wrote:

> can you help me suggesting any open source compilers
> to compile my java code to native code...


I suggest that this is unlikely to help.  Whenever a computer program
is not runnning fast enough, the first step MUST BE to measure it and
understand why.  Use a profiler.  Or write a logfile with lots of
timestamps.  What is your robot spending its time doing?  Until you
know this, any time spent trying to make it go faster is wasted.
  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: SV: matching and "User-Agent:" in robots.txt

2002-03-15 Thread Brian Broderick


Here is how you set it in Java.  Of course, there is a lot more code
involved such as try/catch blocks and stuff.

String referer = "http://spider.desertrealm.com";; 
String user_agent = "DesertRealm.com; 0.2; [J];";

URL hp = new URL(currentURL);
URLConnection hpCon = hp.openConnection();  

HttpURLConnection uc = (HttpURLConnection)hpCon;
hpCon.setRequestProperty ("Referer", referer); 
hpCon.setRequestProperty ("User-Agent", user_agent);

Thanks,

Brian

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Friday, March 15, 2002 1:58 AM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: [Robots] Re: SV: matching and "User-Agent:" in robots.txt



On Thursday, March 14, 2002, at 10:59 , Thomas Huber wrote:

>
>
> How can the UA be set in Java?
>
>> Create a user-agent object thus:
>>
>> "$ua = LWP::RobotUA->new('Banjo/1.1','http://nowhere.int/banjo.html
>> [EMAIL PROTECTED]')

There is no explicit setUserAgent call. However, you can set the header 
itself. I forget the exact call, but it's something like 
urlConnection.setHeader("User-Agent", "Banjo/1.1"). If you want to set 
the From header you'll have to use this method too.

--
Mike


--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send "help" in the
body of a message to "[EMAIL PROTECTED]".

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Sean M. Burke


At 10:36 2002-03-14 -0800, Nick Arnett wrote:
>[with Python] I'm not seeing long, mysterious time-outs as I occasionally 
>did with LWP.

I have never run into this problem, but I have a dim memory that you may be 
alluding to what is a known bug not with LWP, but with old versions (long 
since fixed in modern Perls and/or CPAN) of the socket libraries in IO::*.

>Following an LWP request through the debugger is a long and convoluted
>journey...

Are you referring to perl -d, or LWP::Debug?

Maybe I should write an addendum to "lwpcook.pod" on figuring out what's 
going wrong, when something does go wrong.  The current lwpcook really 
needs an overhaul, and once my book /Perl & LWP/ is done (hopefully it'll 
be in press within a few weeks), I hope to send up some big doc patches to 
LWP, at the very least revamping lwpcook and then going into each class and 
noting in the docs whether a typical user needs to bother knowing about 
it.  (E.g., you need to know about HTTP::Response; you do /not/ need to 
know about LWP::Protocol.)

In short, if people want to see improvements to LWP, email me and say what 
you want done, and I'll either try my hand at implementing it, or I'll pass 
it on to someone more capable.  LWP is not the product of a massive 
bureaucracy, but of few enough people that you could fit all of us in a 
phone booth.  We're all manically busy, to varying degrees (companies to 
run, children to raise, books/articles/modules to write, etc.), but we do 
at times manage to do what needs doing, if it's pointed out clearly enough 
to stand out from the torrent of email messages (which I find incessantly 
discouraging) that manage no better than "halo I try to use LWP with hotmel 
but not work plz hlp k thx".


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: SV: matching and "User-Agent:" in robots.txt

2002-03-15 Thread mmoran



On Thursday, March 14, 2002, at 10:59 , Thomas Huber wrote:

>
>
> How can the UA be set in Java?
>
>> Create a user-agent object thus:
>>
>> "$ua = LWP::RobotUA->new('Banjo/1.1','http://nowhere.int/banjo.html
>> [EMAIL PROTECTED]')

There is no explicit setUserAgent call. However, you can set the header 
itself. I forget the exact call, but it's something like 
urlConnection.setHeader("User-Agent", "Banjo/1.1"). If you want to set 
the From header you'll have to use this method too.

--
Mike


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-15 Thread Sean M. Burke


I dug around more in Perl LWP's WWW::RobotRules module and the short story 
is that the bug I found exists, but that it's not as bad as I thought.
If you set up a user agent with the name "Foobar/1.23", a WWW::RobotRules 
object actually /does/ currently know to strip off the "/1.23" (this 
happens in the 'agent' method, not in the is_me method where I expected it).

The current bug surfaces only when your user-agent name is more than one 
word; if your user-agent name is "Foobar/1.23 [[EMAIL PROTECTED]]", the 
current 'agent' method's logic says "well, it doesn't end in 
'/number.number', so there's no version to strip off".

So I'm going to send Gisle Aas a patch so that the first word, minus any 
version suffix, is what's used for matching.  It's just a matter of adding 
a line saying:
 $name = $1 if $name =~ m/(\S+)/; # get first word
in the 'agent' method.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Achim Dreyer


On Thu, 14 Mar 2002, Erick Thompson wrote:

> I would suggest using C# if you are using the Windows platform. It's quite
> fast, and MS provides a program to convert from Java to C#[1], so you may
> save a bunch of redevelopment time. Even if you're not on Windows, there
> will soon options to still use .net, such as Rotor[2] and Mono[3].
>
> On a related note, I am working on a spider in C#, and if a lot of other
> people are working a new spiders as well, perhaps we should look at starting
> a .net open source project, based on something like the BSD/X11 license (to
> allow commercial inclusion).
>
> Erick
>
> [1]
> http://msdn.microsoft.com/vstudio/downloads/jca/default.asp
>
> [2]
> http://www.oreillynet.com/pub/a/dotnet/2002/03/04/rotor.html
>
> [3]
> http://www.go-mono.com/

---
Hy,

Why not pre-compile the java code to native code ?
-> No redevelopment time at all.


Regards,
Achim Dreyer

---
A. Dreyer, UNIX System Administrator and Internet Security Consultant



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".