[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Achim Dreyer


On Thu, 14 Mar 2002, Erick Thompson wrote:

 I would suggest using C# if you are using the Windows platform. It's quite
 fast, and MS provides a program to convert from Java to C#[1], so you may
 save a bunch of redevelopment time. Even if you're not on Windows, there
 will soon options to still use .net, such as Rotor[2] and Mono[3].

 On a related note, I am working on a spider in C#, and if a lot of other
 people are working a new spiders as well, perhaps we should look at starting
 a .net open source project, based on something like the BSD/X11 license (to
 allow commercial inclusion).

 Erick

 [1]
 http://msdn.microsoft.com/vstudio/downloads/jca/default.asp

 [2]
 http://www.oreillynet.com/pub/a/dotnet/2002/03/04/rotor.html

 [3]
 http://www.go-mono.com/

---
Hy,

Why not pre-compile the java code to native code ?
- No redevelopment time at all.


Regards,
Achim Dreyer

---
A. Dreyer, UNIX System Administrator and Internet Security Consultant



--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Tim Bray


Sean M. Burke wrote:

 In short, if people want to see improvements to LWP, email me and say what 
 you want done


For robots, you need a call that says fetch this URL, but get a maximum
of XX bytes and spend a maximum of YY seconds doing it.  Return status
should tell you whether it finished or timed out, and how many bytes
were actually retrieved.

BTW, have the LWP timeouts been fixed?  As recently as early 2000, they
were known to generally not work.  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Nick Arnett


Having worked in Perl and Python, I'll recommend Python.  Although I haven't
been using it for long, I'm definitely more productive with it.  Performance
seems fine, though I haven't really pushed hard on it.  I'm not seeing long,
mysterious time-outs as I occasionally did with LWP.  And I hit some weird
bug in LWP a few weeks ago, which resulted in a strange error message that I
eventually discovered was coming out of the expat DLL for XML.  Instead of
retrieving the page I wanted, it was misinterpreting a server error.  I wish
I could be more specific, but I never did figure out what was really going
on.  Following an LWP request through the debugger is a long and convoluted
journey...

Nick

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of srinivas mohan
 Sent: Thursday, March 14, 2002 9:48 AM
 To: [EMAIL PROTECTED]
 Subject: [Robots] better language for writing a Spider ?



 Hello,

 I am working on a robot develpoment, in java,.
 We are developing a search enginealmost the
 complete engine is developed...
 We used  java for the devlopment...but the performance
 of java api in fetching the web pages is too low,
 basically we developed out own URL Connection , as
 we dont have some features like timeout...
 supported  by the java.net.URLConnection api ..

 Though there are better spiders in java..like
 Mercator..we could not achive a better performance
 with our product...

 Now as the performance is  low..we wanted to redevelop
 our spider..in a language like c or perl...and use
 it with our existing product..

 I will be thankful if any one can help me choosing
 the better language..where i can get better
 performance..

 Thanks in advance
 Mohan



 __
 Do You Yahoo!?
 Yahoo! Sports - live college hoops coverage
 http://sports.yahoo.com/

 --
 This message was sent by the Internet robots and spiders
 discussion list ([EMAIL PROTECTED]).  For list server commands,
 send help in the body of a message to [EMAIL PROTECTED].


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Otis Gospodnetic


 I am working on a robot develpoment, in java,.
 We are developing a search enginealmost the 
 complete engine is developed...
 We used  java for the devlopment...but the performance
 of java api in fetching the web pages is too low,
 basically we developed out own URL Connection , as
 we dont have some features like timeout...
 supported  by the java.net.URLConnection api ..

Look at Java 1.4, it addresses these issues (socket timeouts,
non-blocking IO, etc.)

 Though there are better spiders in java..like
 Mercator..we could not achive a better performance
 with our product...

I thought Mercator numbers were pretty good, no?

 Now as the performance is  low..we wanted to redevelop
 our spider..in a language like c or perl...and use
 it with our existing product..

You could look at Python, Ultraseek was/is written in it, from what I
remember.
Also, obviously Perl has been used for writing big crawlers, so you can
use that, too.

 I will be thankful if any one can help me choosing 
 the better language..where i can get better performance..

Of course, the choice of a language is not a performance panacea.

Otis


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Tim Bray


At 09:47 AM 14/03/02 -0800, srinivas mohan wrote:
Now as the performance is  low..we wanted to redevelop
our spider..in a language like c or perl...and use
it with our existing product..

I will be thankful if any one can help me choosing 
the better language..where i can get better
performance..

You'll never get better performance until you understand why you
had lousy performance before.  It's not obvious to me why Java
should get in the way.

I've written two very large robots and used perl both times.
There were two good reasons to choose perl:

- A robot fetches pages, analyzes them, and manages a database
  of been-processed and to-process.  The fetching involves no CPU.
  The database is probably the same in whatever language you use.
  THus the leftover computation is picking apart pages looking 
  for URLs and BASE values and so on... perl is hard to beat
  for that type of code.
- Time-to-market was criticial.  Using perl means you have to write
  much less code than in java or C or whatever, so you get done
  quicker.

It's not clear that you can write a robot to run faster than a
well-done perl one.  It is clear you can write one that's much
more maintainable, perl makes it too easy to write obfuscated code.
Another disadvantage of perl is the large memory footprint - since
a robot needs to be highly parallel, you probably can't afford to
have a perl process per execution thread.

Next time I might go with python.  Its regexp engine isn't quite 
as fast, but the maintainability is better.  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Tim Bray


At 10:36 AM 14/03/02 -0800, Nick Arnett wrote:

  I wish
I could be more specific, but I never did figure out what was really going
on.  Following an LWP request through the debugger is a long and convoluted
journey...

I totally agree with Nick that when LWP works, it's OK, but when
it doesn't, debugging is beyond the scope of mere mortals.  ANd
it just doesn't do timeouts or input throttling.  I tried to
get it to do timeouts, it didn't, I went and found the appropriate
discussion group and half the messages were having trouble with
timeouts... mind you that was early 2000, maybe things have
improved? -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread srinivas mohan


Hello
Thank you for the suggestions on selecting a language
for writing a spider..
so i had decided to go with python, but i still have a

small idea of testing my java spider compliing to a 
native code for windows platform...and check for any 
improvement...

can you help me suggesting any open source compilers
to compile my java code to native code...

Thanks in Advance,
Mohan
--- Tim Bray [EMAIL PROTECTED] wrote:
 
 At 10:36 AM 14/03/02 -0800, Nick Arnett wrote:
 
   I wish
 I could be more specific, but I never did figure
 out what was really going
 on.  Following an LWP request through the debugger
 is a long and convoluted
 journey...
 
 I totally agree with Nick that when LWP works, it's
 OK, but when
 it doesn't, debugging is beyond the scope of mere
 mortals.  ANd
 it just doesn't do timeouts or input throttling.  I
 tried to
 get it to do timeouts, it didn't, I went and found
 the appropriate
 discussion group and half the messages were having
 trouble with
 timeouts... mind you that was early 2000, maybe
 things have
 improved? -Tim
 
 
 --
 This message was sent by the Internet robots and
 spiders discussion list ([EMAIL PROTECTED]).  For
 list server commands, send help in the body of a
 message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].