[Robots] Re: better language for writing a Spider ?
On Thu, 14 Mar 2002, Erick Thompson wrote: I would suggest using C# if you are using the Windows platform. It's quite fast, and MS provides a program to convert from Java to C#[1], so you may save a bunch of redevelopment time. Even if you're not on Windows, there will soon options to still use .net, such as Rotor[2] and Mono[3]. On a related note, I am working on a spider in C#, and if a lot of other people are working a new spiders as well, perhaps we should look at starting a .net open source project, based on something like the BSD/X11 license (to allow commercial inclusion). Erick [1] http://msdn.microsoft.com/vstudio/downloads/jca/default.asp [2] http://www.oreillynet.com/pub/a/dotnet/2002/03/04/rotor.html [3] http://www.go-mono.com/ --- Hy, Why not pre-compile the java code to native code ? - No redevelopment time at all. Regards, Achim Dreyer --- A. Dreyer, UNIX System Administrator and Internet Security Consultant -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
Sean M. Burke wrote: In short, if people want to see improvements to LWP, email me and say what you want done For robots, you need a call that says fetch this URL, but get a maximum of XX bytes and spend a maximum of YY seconds doing it. Return status should tell you whether it finished or timed out, and how many bytes were actually retrieved. BTW, have the LWP timeouts been fixed? As recently as early 2000, they were known to generally not work. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
Having worked in Perl and Python, I'll recommend Python. Although I haven't been using it for long, I'm definitely more productive with it. Performance seems fine, though I haven't really pushed hard on it. I'm not seeing long, mysterious time-outs as I occasionally did with LWP. And I hit some weird bug in LWP a few weeks ago, which resulted in a strange error message that I eventually discovered was coming out of the expat DLL for XML. Instead of retrieving the page I wanted, it was misinterpreting a server error. I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... Nick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of srinivas mohan Sent: Thursday, March 14, 2002 9:48 AM To: [EMAIL PROTECTED] Subject: [Robots] better language for writing a Spider ? Hello, I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Thanks in advance Mohan __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Look at Java 1.4, it addresses these issues (socket timeouts, non-blocking IO, etc.) Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... I thought Mercator numbers were pretty good, no? Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. You could look at Python, Ultraseek was/is written in it, from what I remember. Also, obviously Perl has been used for writing big crawlers, so you can use that, too. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Of course, the choice of a language is not a performance panacea. Otis __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
At 09:47 AM 14/03/02 -0800, srinivas mohan wrote: Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. You'll never get better performance until you understand why you had lousy performance before. It's not obvious to me why Java should get in the way. I've written two very large robots and used perl both times. There were two good reasons to choose perl: - A robot fetches pages, analyzes them, and manages a database of been-processed and to-process. The fetching involves no CPU. The database is probably the same in whatever language you use. THus the leftover computation is picking apart pages looking for URLs and BASE values and so on... perl is hard to beat for that type of code. - Time-to-market was criticial. Using perl means you have to write much less code than in java or C or whatever, so you get done quicker. It's not clear that you can write a robot to run faster than a well-done perl one. It is clear you can write one that's much more maintainable, perl makes it too easy to write obfuscated code. Another disadvantage of perl is the large memory footprint - since a robot needs to be highly parallel, you probably can't afford to have a perl process per execution thread. Next time I might go with python. Its regexp engine isn't quite as fast, but the maintainability is better. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
At 10:36 AM 14/03/02 -0800, Nick Arnett wrote: I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... I totally agree with Nick that when LWP works, it's OK, but when it doesn't, debugging is beyond the scope of mere mortals. ANd it just doesn't do timeouts or input throttling. I tried to get it to do timeouts, it didn't, I went and found the appropriate discussion group and half the messages were having trouble with timeouts... mind you that was early 2000, maybe things have improved? -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
Hello Thank you for the suggestions on selecting a language for writing a spider.. so i had decided to go with python, but i still have a small idea of testing my java spider compliing to a native code for windows platform...and check for any improvement... can you help me suggesting any open source compilers to compile my java code to native code... Thanks in Advance, Mohan --- Tim Bray [EMAIL PROTECTED] wrote: At 10:36 AM 14/03/02 -0800, Nick Arnett wrote: I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... I totally agree with Nick that when LWP works, it's OK, but when it doesn't, debugging is beyond the scope of mere mortals. ANd it just doesn't do timeouts or input throttling. I tried to get it to do timeouts, it didn't, I went and found the appropriate discussion group and half the messages were having trouble with timeouts... mind you that was early 2000, maybe things have improved? -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].