Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread ogjunk-nutch
Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.

Thanks,
Otis


--- Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi Michael,
 
 At the moment I have about 3000 domains in my db. I didn't time the 
 performance however having even 100k domains shouldn't have an impact
 
 since it is fetched only once from the database to the cache. A
 little 
 performance hit should be over 100k (depends on number elements
 defined 
 in xml file).
 
 After a few birth problems, the plugin works nicely and I do not feel
 
 any impact.
 
 Regards,
 
 Gal
 
 
 Michael Ji wrote:
  hi,
 
  How is performance concern if the size of domain list
  reaches 10,000?
 
  Micheal Ji,
 
  --- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote:
 

   [
 
  
  http://issues.apache.org/jira/browse/NUTCH-100?page=all

  ]
 
  Gal Nitzan updated NUTCH-100:
  -
 
 type: Improvement  (was: New Feature)
  Description: 
  Hi,
 
  I have written a new plugin, based on the URLFilter
  interface: urlfilter-db .
 
  The purpose of this plugin is to filter domains,
  i.e. I would like to crawl the world but to fetch
  only certain domains.
 
  The plugin uses a caching system (SwarmCache, easier
  to deploy than JCS) and on the back-end a database.
 
  For each url
 filter is called
  end for
 
  filter
   get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
  end filter
 
 
  The plugin reads the cache size, jdbc driver,
  connection string, table to use and domain field
  from nutch-site.xml
 
 
was:
  Hi,
 
  I have written (not much) a new plugin, based on the
  URLFilter interface: urlfilter-db .
 
  The purpose of this plugin is to filter domains,
  i.e. I would like to crawl the world but to fetch
  only certain domains.
 
  The plugin uses a caching system (SwarmCache, easier
  to deploy than JCS) and on the back-end a database.
 
  For each url
 filter is called
  end for
 
  filter
   get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
  end filter
 
 
  The plugin reads the cache size, jdbc driver,
  connection string, table to use and domain field
  from nutch-site.xml
 
 
  Environment: All Nutch versions  (was: MapRed)
 
  Fixed some issues
  clean up
  Added a patch for Subversion
 
  
  New plugin urlfilter-db
  ---
 
   Key: NUTCH-100
   URL:

  http://issues.apache.org/jira/browse/NUTCH-100
  
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.8-dev
   Environment: All Nutch versions
  Reporter: Gal Nitzan
  Priority: Trivial
   Attachments: AddedDbURLFilter.patch,

  urlfilter-db.tar.gz, urlfilter-db.tar.gz
  
  Hi,
  I have written a new plugin, based on the

  URLFilter interface: urlfilter-db .
  
  The purpose of this plugin is to filter domains,

  i.e. I would like to crawl the world but to fetch
  only certain domains.
  
  The plugin uses a caching system (SwarmCache,

  easier to deploy than JCS) and on the back-end a
  database.
  
  For each url
 filter is called
  end for
  filter
   get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
  end filter
  The plugin reads the cache size, jdbc driver,

  connection string, table to use and domain field
  from nutch-site.xml
 
  -- 
  This message is automatically generated by JIRA.
  -
  If you think it was sent incorrectly contact one of
  the administrators:

 
  
  http://issues.apache.org/jira/secure/Administrators.jspa

  -
  For more information on JIRA, see:
 http://www.atlassian.com/software/jira
 
 
  
 
 
 
  
  __ 
  Yahoo! Music Unlimited 
  Access over 1 million songs. Try it free.
  http://music.yahoo.com/unlimited/
 
  .
 

 
 
 



Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:

Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.


Slightly off-topic, but I hope this is relevant to the original reason 
for creating this plugin...


There is a BSD-licensed library that implements a large subset of 
regexps, which is based on finite automata. It is reported to be 
scalable and very fast (benchmarks are surely impressive):


http://www.brics.dk/~amoeller/automaton/

I suggest to do some tests with 100k regexps and see if it survives.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



reprocessing hanging tasks

2005-10-10 Thread Stefan Groschupf

Hi,
I tried to understand the jobtracker code.
Hmm more than 1000 lines of code in just one class. :-( This makes  
understanding code very difficult.


Anyway I'm missing a mechanism to reprocess hanging tasks. May I just  
didn't find the code, but I invest some time to find it.
As the google paper describe the original map reduce reprocess tasks  
that may still run but are much slower than the other tasks because  
of some hardware failures.
Since I notice that task-tracker isn't that stabile yet, I would  
really love to have such a reprocessing mechanism.
Actually I seen tasks are reprocessed in case the task-tracker crash  
and does not return any reports anymore or the task-tracker report a  
task failure.
But for example in case the network speed of a fetching mapping task  
is very very slow the job itself needs for ever.


I would suggest add start time and finishing time to the task object  
and set these values until status changes.
We can calculate a average time a task need for processing based on  
this values.
Than we have a configurable value of minimal finished tasks before we  
start to reprocessing tasks. For example 80% tasks need to be ready.
Further more we have a configurable values threshold, in case the  
processing time of a task is treshold * average processing time, we  
just reprocessing the task on a other tasktracker.


What do people think?
Do I miss the section in the jobtracker where this is done, or are  
people interested that I submit a patch doing this mechanism?


Stefan 


Re: reprocessing hanging tasks

2005-10-10 Thread Stefan Groschupf

Doug,
I definitely run several times in problems, where task-trackers was  
sending hard-beat messages but hadn't process the job anymore.
For example no new pages was fetched but the page / sec. statistic  
becomes slow and slower.
I personal would think it makes more sense in case the jobtracker  
decide if a task is over the average processing time and need to be  
reexcuted or not.
The last section of the google paper covers this issue and they  
notice performance improvements by reexecutng task that are over a  
specific time.


May we misunderstand each other, I do not mean tasks that crash, I  
mean tasks that are 20 times slower on one machine as the other tasks  
on the other machines.


Stefan


Am 10.10.2005 um 20:16 schrieb Doug Cutting:


Stefan Groschupf wrote:

Do I miss the section in the jobtracker where this is done, or  
are  people interested that I submit a patch doing this mechanism?




This is mostly already implemented.  The tasktracker fails tasks  
that do not update their status within a configurable timeout.   
Task status is updated each time a task reads an input, writes an  
output or calls the Reporter.setStatus() method.  The jobtracker  
will retry failed tasks up to four times.


The mapred-based fetcher also should not hang.  It will exit even  
when it has hung threads.  So the task timeout should be set to the  
maximum amount of time that any single page should require to fetch  
 parse.  By default it is set to 10 minutes.


Doug




---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: reprocessing hanging tasks

2005-10-10 Thread Doug Cutting

Stefan Groschupf wrote:
May we misunderstand each other, I do not mean tasks that crash, I  mean 
tasks that are 20 times slower on one machine as the other tasks  on the 
other machines.


Ah, I call that speculative re-exectution.  Nutch does not yet 
implement that.


I don't think speculative re-execution of tasks would help much with 
fetching, since a fetch task that is slow on one machine will probably 
be slow on another.  What would probably make the fetcher faster is to 
use Thread.kill() on fetcher threads which have exceeded a timeout, and 
then replace them with a new Fetcher thread.


Speculative re-execution is among the list of features we'd like to add, 
but it is not a high priority for me.


Doug


fetch speed issue

2005-10-10 Thread AJ Chen
Another observation: when the same size fetch list and same number of
threads were used, the fetcher started at different speed in different runs,
ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation
in downlaod speed could be due to the variation in DSL connection. If using
stable connections like T1 or fiber, I expect the fetcher should start at
the same spped. Could someone using T1 line or fiber connection verify that
the fetcher starts always at similar speed? Given large enough number of
threads, did your fetcher always reliably achieve the maximum speed, i.e.
using the full bandwidth of the connection?

Thanks,
AJ


Re: Re[2]: what contibute to fetch slowing down

2005-10-10 Thread Daniele Menozzi
On  03:36:45 03/Oct , Michael wrote:
 3mbit, 100 threads = 15 pages/sec
 cpu is low during fetch, so its bandwidth limit.

yes, cpu is low, and even memory is quite free. But, with a 10MB in/out
I cannot obtain good results (and I do not parse results, simply fetch
them).
If I use 100 threads, I can download pages at 500KB/s for about 5 seconds,
but after that, the download rate falls to 0. If I set 20 threads, I can 
download 
at 200KB for 4/5 minutes, and the rate initially seems very stable. But, after
theese few minutes, the rate starts to get lower and lower, and tends to reach
zero pages/s.

I cannot understand what could be the problem. Every thread number I choose, 
the rate _always_
decrease, till it has reached 1/2 pages/s.
I;ve tried 2 different machines, but the problem is always the same.

Can you please give me some advices?
Thank you
Daniele



-- 
  Free Software Enthusiast
 Debian Powered Linux User #332564 
 http://menoz.homelinux.org


Re: what contibute to fetch slowing down

2005-10-10 Thread Daniele Menozzi
On  09:59:45 03/Oct , Doug Cutting wrote:
 I suspect threads are hanging, probably in the parser,

I tried to not parse, but without good results.
If I use 100 threads, I can download pages at 500KB/s for about 5 seconds,
but after that, the download rate falls to 0. If I set 20 threads, I can 
download
at 200KB for 4/5 minutes, and the rate initially seems very stable. But, after
theese few minutes, the rate starts to get lower and lower, and tends to reach
zero pages/s.


 but sometimes TCP 
 connections get stuck too.  

If this is the problem, how can I fix it? And How can I recognize it?

Thanks!
-z

-- 
  Free Software Enthusiast
 Debian Powered Linux User #332564 
 http://menoz.homelinux.org


[jira] Created: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing Tuning

2005-10-10 Thread Fuad Efendi (JIRA)
Nutch - Fetcher - HTTP - Performance Testing  Tuning
-

 Key: NUTCH-109
 URL: http://issues.apache.org/jira/browse/NUTCH-109
 Project: Nutch
Type: Improvement
  Components: fetcher  
Versions: 0.7, 0.6, 0.7.1, 0.8-dev
 Environment: Nutch: Windows XP, J2SE 1.4.2_09
Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
Reporter: Fuad Efendi


1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
but also for intermediary network equipment 

2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, 
or at least Nutch sends Connection: close before closing in JVM 
Socket.close() ...

I need to perform very objective tests, probably 2-3 days; new plugin 
crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
http-plugin needs few days...

I am using separate network segment with Windows XP (Nutch), and Suse Linux 
(Apache HTTPD + 120,000 pages)

Please find attached new plugin based on 
http://www.innovation.ch/java/HTTPClient/

Please note: 

Class HttpFactory contains cache of HTTPConnection objects; each object run 
each thread; each object is absolutely thread-safe, so we can send multiple GET 
requests using single instance:
   private static int CLIENTS_PER_HOST = 
NutchConf.get().getInt(http.clients.per.host, 3);

I'll add more comments after finishing tests...



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing Tuning

2005-10-10 Thread Fuad Efendi (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
--

Attachment: protocol-httpclient-innovation-0.1.0.zip

New Plugin, you may play with commenting this code in HttpFactory
static {
CookieModule.setCookiePolicyHandler(null);
}



 Nutch - Fetcher - HTTP - Performance Testing  Tuning
 -

  Key: NUTCH-109
  URL: http://issues.apache.org/jira/browse/NUTCH-109
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7, 0.6, 0.7.1, 0.8-dev
  Environment: Nutch: Windows XP, J2SE 1.4.2_09
 Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
 Reporter: Fuad Efendi
  Attachments: protocol-httpclient-innovation-0.1.0.zip

 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
 but also for intermediary network equipment 
 2. Web Server creates Client thread and hopes that Nutch really uses 
 HTTP/1.1, or at least Nutch sends Connection: close before closing in JVM 
 Socket.close() ...
 I need to perform very objective tests, probably 2-3 days; new plugin 
 crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
 http-plugin needs few days...
 I am using separate network segment with Windows XP (Nutch), and Suse Linux 
 (Apache HTTPD + 120,000 pages)
 Please find attached new plugin based on 
 http://www.innovation.ch/java/HTTPClient/
 Please note: 
 Class HttpFactory contains cache of HTTPConnection objects; each object run 
 each thread; each object is absolutely thread-safe, so we can send multiple 
 GET requests using single instance:
private static int CLIENTS_PER_HOST = 
 NutchConf.get().getInt(http.clients.per.host, 3);
 I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

2005-10-10 Thread Robert Benea
On 10/6/05, Dawid Weiss [EMAIL PROTECTED] wrote:


  That would be great, I looked already to the code base in the plug-in
  directory and it seems you use this call to get the clustering results:
 
  controller.query(lingo-nmf-km-3, pseudo-query, requestParams);
  am I right ?
 
  anyway, I want to have the type of algorithm used for clustering, picked
 up
  from the xml file, it should be easy to do so.

 Yes, it is quite easy -- the controller above can be instantiated from
 an XML file or from a Beanshell script using a local controller
 component (not in the Nutch codebase yet). There are unit tests of that
 controller in Carrot2 CVS, but it has been added recently so I didn't
 have the time to integrate it in a solid working example.


Hi Dawid,

I was able to hack the Clusterer class and made it work for STH, here is my
hack ;-)

// Clustering component here.
LocalComponentFactory stcFactory = new LocalComponentFactoryBase() {
public LocalComponent getInstance() {
HashMap defaults = new HashMap();

// These are adjustments settings for the clustering algorithm...
// You can play with them, but the values below are our 'best guess'
// settings that we acquired experimentally.
defaults.put(lsi.threshold.clusterAssignment, 0.150);
defaults.put(lsi.threshold.candidateCluster, 0.775);

// TODO: this should be eventually replaced with documents from Nutch
// tagged with a language tag. There is no need to again determine
// the language of a document.
 return new STCLocalFilterComponent();

}
};
controller.addLocalComponentFactory(filter.lingo-old, stcFactory);
}

But I have two questions:

1. AHC doesn't have any local filter that implements LocalFilterComponent,
RawClusterProducer and so on, how can I achieve that, form a very
superficial point of view it seem that nobody uses AHC class?,
2. How do the stopwords and stemmers work for STC ?


There is one potential problem that I see -- Nutch plugins require
 explicit JAR references. If you want to switch between algorithms you'll
 need to either put all Carrot2 JARs in the descriptor, put them in
 CLASSPATH before Nutch starts or do some other trickery with class
 loading.


I just put the stc.jar in the lib directory, I will optimize it later ;-).

Cheers,
R.

I won't be able to help you until next week, but after then I'll try to
 find some time to prepare you an example of how the scriptable
 controller is used (or look at the unit tests, the component is called
 carrot2-local-controller.

 Dawid