[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-04-20 Thread Christophe Noel (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375300 ] 

Christophe Noel commented on NUTCH-173:
---

We are TENS of nutch users using this precious patch.

Most of nutch users are not making whole-web search engine (too much hardware 
needed) but are willing to develop dedicated search engines.

We crawl sometimes 1000, sometimes 25000 web servers and it really slow down 
the crawling with 25000 entries in prefix-urlfilter.

This patch is NEEDED !

Christophe Noël
CETIC
Belgium

 PerHost Crawling Policy ( crawl.ignore.external.links )
 ---

  Key: NUTCH-173
  URL: http://issues.apache.org/jira/browse/NUTCH-173
  Project: Nutch
 Type: New Feature

   Components: fetcher
 Versions: 0.7.1, 0.7, 0.8-dev
 Reporter: Philippe EUGENE
 Priority: Minor
  Attachments: patch.txt, patch08.txt

 There is two major way of crawl in Nutch.
 Intranet Crawl : forbidden all, allow somes few host
 Whole-web crawl : allow all, forbidden few thinks
 I propose a third type of crawl.
 Directory Crawl : The purpose of this crawl is to manage few thousands of 
 host wihtout managing rules pattern in UrlFilterRegexp.
 I made two patch for : 0.7, 0.7.1 and 0.8-dev
 I propose a new boolean property in nutch-site.xml : 
 crawl.ignore.external.links, with false value at default.
 By default this new feature don't modify the behavior of nutch crawler.
 When you setup this property to true, the crawler don't fetch external links 
 of the host.
 So the crawl is limited to the host that you inject at the beginning at the 
 crawl.
 I know there is some proposal of new crawl policy using the CrawlDatum in 
 0.8-dev branch. 
 This feature colud be a easiest way to add quickly new crawl feature to 
 nutch, waiting for a best way to improve crawl policy.
 I post two patch.
 Sorry for my very poor english 
 --
 Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-74) French Analyzer Plugin

2005-07-19 Thread Christophe Noel (JIRA)
French Analyzer Plugin
--

 Key: NUTCH-74
 URL: http://issues.apache.org/jira/browse/NUTCH-74
 Project: Nutch
Type: New Feature
 Environment: Nutch
Reporter: Christophe Noel
 Attachments: analyze-french.zip

This is DRAFT for a new plugin for French Analysis (all java file come from 
Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial 
forms removing, ...

Analyze-frech should be used instead of NutchDocumentAnalysis as described by 
Jerome Charron in New Language Identifier project. It should be used also as a 
query-parser in Nutch searcher.

We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could 
anyone help me to build this new Extension Point please ?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-71) Search web page doesn't not focus on query input

2005-07-12 Thread Christophe Noel (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-71?page=all ]

Christophe Noel updated NUTCH-71:
-

Attachment: searchQueryFocus.patch

Search.html (fr,en) and search.jsp focus patch.

 Search web page doesn't not focus on query input
 

  Key: NUTCH-71
  URL: http://issues.apache.org/jira/browse/NUTCH-71
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Christophe Noel
 Priority: Minor
  Attachments: searchQueryFocus.patch

 In search.html and search.jsp , keyboard cursor does not focus in the form 
 query input.
 I've made a patch for en and fr search.html and for search.jsp.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira