date:20051012

org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace


I need a recursive file delete for cleaning up after a JUnit test.
There is one in Commons IO (org.apache.commons.io):

  FileUtils.deleteDirectory(File directory)

I wonder whether I should use org.apache.commons.io as a new
jar added to lib or arrange a libtest for jars only used by
JUnit tests, or write yet another recursive delete and put
it in src/test/org/apache/nutch/util.

If org.apache.commons.io were included, then ndfs/DF.java could
almost be replaced by FileSystemUtils.freeSpace(path), but DF
also reports disk space capacity.

Paul

Re: org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace


Paul Baclace wrote:

I need a recursive file delete for cleaning up after a JUnit test.


I just now spotted:

  org.apache.nutch.fs.LocalFileSystem.delete(File f)

which does what I want (recursive, local delete).

So no need for common.io.

Paul

Re: nutch downloads

2005-10-12 Thread Erik Hatcher


Joshua,

We have received your message.  I'm only remotely involved with  
Nutch, so I'm prodding other committers to Nutch to please update the  
links to take advantage of the mirroring system in place.


Please - someone reply back volunteering to correct this ASAP.

Erik


On Oct 11, 2005, at 6:51 PM, Joshua Slive wrote:

Downloads from nutch are now by far the biggest consumers of  
bandwidth on apache.org.  This is because nutch is not following  
apache download policies and is not using our mirroring system at  
all.  Please correct this right away.  See

http://www.apache.org/dev/mirrors.html
for more details.

Please confirm that you have received this message and are taking  
action.


Thanks.

Joshua (Apache mirror guy)

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331877 ] 

Fuad Efendi commented on NUTCH-109:
---

Ok, I'll do it tonight;
I believe fetcher.server.delay means Wait for a Response from Server, then 
throw a Timeout Exception
I can also execute 1000 threads, we will have fair comparison even with 
fetcher.server.delay=50 seconds (fair - because of too many threads - we will 
have probably 20 requests per second, 20 * 50 = 1000)


 Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
 ---

  Key: NUTCH-109
  URL: http://issues.apache.org/jira/browse/NUTCH-109
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7, 0.8-dev, 0.6, 0.7.1
  Environment: Nutch: Windows XP, J2SE 1.4.2_09
 Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
 Reporter: Fuad Efendi
  Attachments: protocol-httpclient-innovation-0.1.0.zip

 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
 but also for intermediary network equipment 
 2. Web Server creates Client thread and hopes that Nutch really uses 
 HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM 
 Socket.close() ...
 I need to perform very objective tests, probably 2-3 days; new plugin 
 crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
 http-plugin needs few days...
 I am using separate network segment with Windows XP (Nutch), and Suse Linux 
 (Apache HTTPD + 120,000 pages)
 Please find attached new plugin based on 
 http://www.innovation.ch/java/HTTPClient/
 Please note: 
 Class HttpFactory contains cache of HTTPConnection objects; each object run 
 each thread; each object is absolutely thread-safe, so we can send multiple 
 GET requests using single instance:
private static int CLIENTS_PER_HOST = 
 NutchConf.get().getInt(http.clients.per.host, 3);
 I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki


Gal Nitzan wrote:

Hi Andrzej,

Yes, it seems like a good option. However, it is GPL, and I noticed in 
one of the posts that this license is no good for apach.org :).


If you refer to the bricks automata library, it's BSD-licensed.  I 
mentioned in one of the posts that the Innovation httpclient is L-GPL, 
and hence not acceptable for apache.org.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331897 ] 

Fuad Efendi commented on NUTCH-109:
---

Opps... need to learn more!
[protocol-httpclient] Http.java is Singleton, it uses 
MultiThreadedHttpConnectionManager
It uses single instance of HttpClient for all hosts and all threads. 

 Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
 ---

  Key: NUTCH-109
  URL: http://issues.apache.org/jira/browse/NUTCH-109
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7, 0.8-dev, 0.6, 0.7.1
  Environment: Nutch: Windows XP, J2SE 1.4.2_09
 Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
 Reporter: Fuad Efendi
  Attachments: protocol-httpclient-innovation-0.1.0.zip

 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
 but also for intermediary network equipment 
 2. Web Server creates Client thread and hopes that Nutch really uses 
 HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM 
 Socket.close() ...
 I need to perform very objective tests, probably 2-3 days; new plugin 
 crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
 http-plugin needs few days...
 I am using separate network segment with Windows XP (Nutch), and Suse Linux 
 (Apache HTTPD + 120,000 pages)
 Please find attached new plugin based on 
 http://www.innovation.ch/java/HTTPClient/
 Please note: 
 Class HttpFactory contains cache of HTTPConnection objects; each object run 
 each thread; each object is absolutely thread-safe, so we can send multiple 
 GET requests using single instance:
private static int CLIENTS_PER_HOST = 
 NutchConf.get().getInt(http.clients.per.host, 3);
 I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting


Andrzej Bialecki wrote:
100k regexps is still alot, so I'm not totally sure it would be much 
faster, but perhaps worth checking.


I have worked with this type of technology before (minimized, 
determinized FSAs, constructed from large sets of strings  expressions) 
and it should be very fast to perform lookups, even in large, complex 
FSAs.  Construction of the FSA can be time consuming and should probably 
be done offline, not at fetcher startup time, so that it is only 
performed once for a number of fetcher runs.


Doug

RE: Q about exact match counts?

2005-10-12 Thread Goldschmidt, Dave

Anyone answer this question?  I see in the Hits class that there's a
boolean totalIsExact attribute, but this becomes false only when
deduplication (per site) occurs during the search.  And I see that
underneath Nutch, Lucene will obtain the documents for only the top
hits.

But does Nutch/Lucene return exact match counts for every query?  How
does this scale to very large indexes?

Thanks!
DaveG


-Original Message-
From: Gaulin, Mark [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 05, 2005 2:04 PM
To: nutch-dev@lucene.apache.org
Subject: Q about exact match counts?

Hi 
I was wondering if nutch returns exact match counts or estimated match
counts, or if this behavior was configurable.  The reason I ask is that
on a very large index I'd rather get the top 10 results right and get
them quickly and not spend time getting the total number of documents
that match exactly right. (I assume that's why Google reports totals as
1 to 10 of about 1,000,000)   Is this something that nutch/lucene can
do?
Thanks
Mark

Re: nutch downloads

2005-10-12 Thread Doug Cutting


Erik Hatcher wrote:

Please - someone reply back volunteering to correct this ASAP.


My bad.  I'm fixing this right now.  In 24 hours all Nutch downloads 
should be through the mirrors.


Sorry!

Doug

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki


Doug Cutting wrote:

Andrzej Bialecki wrote:

100k regexps is still alot, so I'm not totally sure it would be much 
faster, but perhaps worth checking.



I have worked with this type of technology before (minimized, 
determinized FSAs, constructed from large sets of strings  expressions) 
and it should be very fast to perform lookups, even in large, complex 
FSAs.  Construction of the FSA can be time consuming and should probably 
be done offline, not at fetcher startup time, so that it is only 
performed once for a number of fetcher runs.


Guess what... this library supports (de)serialization of automata, so 
they can be compiled once, and then just stored/loaded.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331913 ] 

Fuad Efendi commented on NUTCH-109:
---

I was totally wrong and unfair:

Have you seen Kelvin Tan's patch? 
You should take a look, it's in JIRA, and addresses some of the 
HTTP/1.1 issues that you are concerned about. 

http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html 


I need to perform tests with Kelvin Tan's patch too.

 Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
 ---

  Key: NUTCH-109
  URL: http://issues.apache.org/jira/browse/NUTCH-109
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7, 0.8-dev, 0.6, 0.7.1
  Environment: Nutch: Windows XP, J2SE 1.4.2_09
 Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
 Reporter: Fuad Efendi
  Attachments: protocol-httpclient-innovation-0.1.0.zip

 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
 but also for intermediary network equipment 
 2. Web Server creates Client thread and hopes that Nutch really uses 
 HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM 
 Socket.close() ...
 I need to perform very objective tests, probably 2-3 days; new plugin 
 crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
 http-plugin needs few days...
 I am using separate network segment with Windows XP (Nutch), and Suse Linux 
 (Apache HTTPD + 120,000 pages)
 Please find attached new plugin based on 
 http://www.innovation.ch/java/HTTPClient/
 Please note: 
 Class HttpFactory contains cache of HTTPConnection objects; each object run 
 each thread; each object is absolutely thread-safe, so we can send multiple 
 GET requests using single instance:
private static int CLIENTS_PER_HOST = 
 NutchConf.get().getInt(http.clients.per.host, 3);
 I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

suspicious outlink count

2005-10-12 Thread EM


202443 Pages consumed: 13 (at index 13). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.

If there is maxoutlinks already specified in the xml config, why does 
nutch bother counting anything over that again?

keep count of selected url

2005-10-12 Thread Daniele Menozzi

Hi all, I was interesting in keeping count of the number of time every URL
is selected by an user. 
The problem is not on how do I know what page is clicked, but how can
I store every page/number-of-clicks touple? 
What is the best way to store theese informations, and let nutch use them?
Can you please give me some advice on how implement this?
What part of Nutch, and what classes, do I have to modify?
What is the best way to do this?

Thank you very much!!
Menoz

-- 
  Free Software Enthusiast
 Debian Powered Linux User #332564 
 http://menoz.homelinux.org

clustering strategies

2005-10-12 Thread Earl Cahill

I think it would be nice to have a few cluster
strategies on the wiki.

It seems there are at least three separate needs: CPU,
storage and bandwidth, and I think the more those
could be cleanly spread to different boxes, the
better.  

Guess I am imagining a breakdown that lists, by
priority, how things should be broken out.  So someone
could look at the list and say, ok, I have three good
boxes, I should make the best box do x, the second
best do y, etc.  There could also be case studies for
how different folks did their own implementations and
what their crawl/query times were like.

I have a small cluster (up to 15 boxes) and would like
to start to play around and see how things go under
different strategies.  I also have about a million
pages of local content, so I can hammer things pretty
hard without even leaving my network.  I know that may
not match normal conditions, but it could hopefully
remove a variable or two (network latency, slow
sites), to keep things simple at least to start.

I think it also a decent goal to be able to
crawl/index my pages in a night (say eight hours),
which would be around 35 pages/second.  If that isn't
a reasonable goal, I would like to hear why not.

For each strategy, we could have a set of confs
describing how to set things up.  I can picture a gui
which could list box roles (crawler, mapper, whatever)
and boxes available.  The users could drag and drop
their boxes to roles, and confs could then be
generated.  Think it could make for rather easy
design/implementation of clusters that could get
rather complicated.  I can do drag/drop and
interpolate into templates in javascript, so I could
envision a rather simple page.

Maybe we could even store the cluster setup in xml,
and have a script that takes the xml and draws the
cluster.  Then when people report slowness or the
like, they could also post their cluster setup.

I think when users come to nutch, they come with a set
of boxes.  I think it would be nice for them to see
what has worked for such a set of boxes in the past
and be able to easily implement such a strategy.  Kind
of the one hour from download to spidering vision.

Just a few thoughts.

Earl




__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)

OpenSearchServlet outputs illegal xml characters


 Key: NUTCH-110
 URL: http://issues.apache.org/jira/browse/NUTCH-110
 Project: Nutch
Type: Bug
  Components: searcher  
Versions: 0.7
 Environment: linux, jdk 1.5
Reporter: [EMAIL PROTECTED]


OpenSearchServlet does not check text-to-output for illegal xml characters; 
dependent on  search result, its possible for OSS to output xml that is not 
well-formed.  For example, if text has the character FF character in it -- -- 
i.e. the ascii character at position (decimal) 12 --  the produced XML will 
show the FF character as '#12;' The character/entity '#12;' is not legal in 
XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Attachment: fixIllegalXmlChars.patch

Attached patch runs all xml text through a check for bad xml characters.  This 
patch is brutal dropping silently illegal characters.  Patch was made after 
hunting xalan, jdk, and nutch itself for a method that would do the above 
filtering but was unable to find any such method -- perhaps an oversight on my 
part?

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: fixIllegalXmlChars.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread Chris Mattmann

Hi,

 I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a !CDATA[ ]! tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?

  If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.

Cheers,
 Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 12, 2005 5:19 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
 characters
 
  [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
 
 [EMAIL PROTECTED] updated NUTCH-110:
 
 
 Attachment: fixIllegalXmlChars.patch
 
 Attached patch runs all xml text through a check for bad xml characters.
 This patch is brutal dropping silently illegal characters.  Patch was made
 after hunting xalan, jdk, and nutch itself for a method that would do the
 above filtering but was unable to find any such method -- perhaps an
 oversight on my part?
 
  OpenSearchServlet outputs illegal xml characters
  
 
   Key: NUTCH-110
   URL: http://issues.apache.org/jira/browse/NUTCH-110
   Project: Nutch
  Type: Bug
Components: searcher
  Versions: 0.7
   Environment: linux, jdk 1.5
  Reporter: [EMAIL PROTECTED]
   Attachments: fixIllegalXmlChars.patch
 
  OpenSearchServlet does not check text-to-output for illegal xml
 characters; dependent on  search result, its possible for OSS to output
 xml that is not well-formed.  For example, if text has the character FF
 character in it -- -- i.e. the ascii character at position (decimal) 12 --
 the produced XML will show the FF character as '#12;' The
 character/entity '#12;' is not legal in XML according to
 http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331950 ] 

Fuad Efendi commented on NUTCH-109:
---

Please see attachment for more details.

In order to be fair (protocol-http uses single shared Socket per Host) I tried 
to modify this line in new plugin, HttpFactory.java:
private static int CLIENTS_PER_HOST = 
NutchConf.get().getInt(http.clients.per.host, 1);

It was 3 before. However, with http.clients.per.host=1 new plugin stops in a 
dead-lock. I tried few times, it always stops after 3-4 minutes. So, results 
are with http.clients.per.host=3 for new plugin (as it was before), but new 
plugin didn't pass the test, just a baseline.



New Test Results:
===

1. PROTOCOL-HTTP 
=
910,549,682 bytes (size on disk, WebDB+Segments)
1,201,908 milliseconds

2. PROTOCOL-HTTPCLIENT

935,856,675 bytes
1,261,064 milliseconds

999. PROTOCOL-HTTPCLIENT-INNOVATION
==
936,152,532 bytes
1,305,377 milliseconds




nutch-site.xml
==
property
  namefetcher.server.delay/name
  value0/value
/property

property
  namefetcher.threads.per.host/name
  value20/value
/property

property
  namehttp.timeout/name
  value3/value
/property

property
  namehttp.content.limit/name
  value-1/value
/property



Client:
===
IBM ThinkPad T42p, 2Ghz, 2Gb, Windows XP, J2SE 1.4.2_09


Server:
===
Suse Linux 9.3, Apache HTTPD 2.0.53-9.5, Worker


Command:

bin\nutch7 crawl url3.txt -dir crawl005 -threads 20 -depth 6

(Modified crawl without indexing)


 Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
 ---

  Key: NUTCH-109
  URL: http://issues.apache.org/jira/browse/NUTCH-109
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7, 0.8-dev, 0.6, 0.7.1
  Environment: Nutch: Windows XP, J2SE 1.4.2_09
 Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
 Reporter: Fuad Efendi
  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt

 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
 but also for intermediary network equipment 
 2. Web Server creates Client thread and hopes that Nutch really uses 
 HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM 
 Socket.close() ...
 I need to perform very objective tests, probably 2-3 days; new plugin 
 crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
 http-plugin needs few days...
 I am using separate network segment with Windows XP (Nutch), and Suse Linux 
 (Apache HTTPD + 120,000 pages)
 Please find attached new plugin based on 
 http://www.innovation.ch/java/HTTPClient/
 Please note: 
 Class HttpFactory contains cache of HTTPConnection objects; each object run 
 each thread; each object is absolutely thread-safe, so we can send multiple 
 GET requests using single instance:
private static int CLIENTS_PER_HOST = 
 NutchConf.get().getInt(http.clients.per.host, 3);
 I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
--

Attachment: test_results.txt

 Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
 ---

  Key: NUTCH-109
  URL: http://issues.apache.org/jira/browse/NUTCH-109
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7, 0.8-dev, 0.6, 0.7.1
  Environment: Nutch: Windows XP, J2SE 1.4.2_09
 Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
 Reporter: Fuad Efendi
  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt

 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
 but also for intermediary network equipment 
 2. Web Server creates Client thread and hopes that Nutch really uses 
 HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM 
 Socket.close() ...
 I need to perform very objective tests, probably 2-3 days; new plugin 
 crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
 http-plugin needs few days...
 I am using separate network segment with Windows XP (Nutch), and Suse Linux 
 (Apache HTTPD + 120,000 pages)
 Please find attached new plugin based on 
 http://www.innovation.ch/java/HTTPClient/
 Please note: 
 Class HttpFactory contains cache of HTTPConnection objects; each object run 
 each thread; each object is absolutely thread-safe, so we can send multiple 
 GET requests using single instance:
private static int CLIENTS_PER_HOST = 
 NutchConf.get().getInt(http.clients.per.host, 3);
 I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

org.apache.commons.io.FileUtils

Re: org.apache.commons.io.FileUtils

Re: nutch downloads

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

RE: Q about exact match counts?

Re: nutch downloads

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

suspicious outlink count

keep count of selected url

clustering strategies

[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

19 matches

Site Navigation

Mail list logo

Footer information