Tobias N. Sasse wrote:
And most important the bug: With increasing number of pages I receive
zillions of
"java.net.BindException: Address already in use: connect"
Just to let you know, this bug could be fixed. Cause of the problem was
that I have been testing under Windows XP and this crappy OS only uses a
port range of ~4000 ports for TCP/IP connections, and needs up to 4
minutes to clear them again. Thus connections will be refused once those
4000 are satureated, until the OS clears them again.
Under a linux environment I did not occur these problems. The design
questions remain open and I am looking for your feedback!
Some interesting numbers:
With a Dualcore 2x1.8 GHz, 2 GB Ram, S-ATA Drive
Client: Java-6
Server: local with Apache HTTPD 2 (standard config)
I gain a maximum of 146 pages/second. Annotation: The crawler is doing
heavy String Processing and strips out all HTML Tags and stuff. As I am
actually working on a search engine, I only want to have the plain text
of a website, the html and script / image / bla stuff is not relevant
for my search algorithms, so I delete them. You could cut back the
String processing, that would push the pages/second rate even further.
But than there will be more I/O cause the files we are writing are larger.
With kind regards,
Tobi
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]