[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )
[ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375300 ] Christophe Noel commented on NUTCH-173: --- We are TENS of nutch users using this precious patch. Most of nutch users are not making whole-web search engine (too much hardware needed) but are willing to develop dedicated search engines. We crawl sometimes 1000, sometimes 25000 web servers and it really slow down the crawling with 25000 entries in prefix-urlfilter. This patch is NEEDED ! Christophe Noël CETIC Belgium PerHost Crawling Policy ( crawl.ignore.external.links ) --- Key: NUTCH-173 URL: http://issues.apache.org/jira/browse/NUTCH-173 Project: Nutch Type: New Feature Components: fetcher Versions: 0.7.1, 0.7, 0.8-dev Reporter: Philippe EUGENE Priority: Minor Attachments: patch.txt, patch08.txt There is two major way of crawl in Nutch. Intranet Crawl : forbidden all, allow somes few host Whole-web crawl : allow all, forbidden few thinks I propose a third type of crawl. Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp. I made two patch for : 0.7, 0.7.1 and 0.8-dev I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default. By default this new feature don't modify the behavior of nutch crawler. When you setup this property to true, the crawler don't fetch external links of the host. So the crawl is limited to the host that you inject at the beginning at the crawl. I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy. I post two patch. Sorry for my very poor english -- Philippe -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
fetcher.thread.per.host not working ??
Hello, There is something wrong with thread per host... Only one thread should only fetch one host at the same time, so why do i get these 2 connect time out (15 sec) at 13:15 and 15 seconds ?!!! This is not normal and so I get about 1000 errors when I crawl about 1400 pages... *Here is the log :* 051121 131515 17 fetch of http://www.forum.math.ulg.ac.be/doc/WMC.html?SESSID=b03ef782492f97b0507f7281ce8088cb failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 051121 131515 17 fetching http://ssel.vub.ac.be/viewcvs/viewcvs.py/cvs/cocompose/MakeAllDist.sh?rev=1.2view=log 051121 131516 70 fetch of http://www.forum.math.ulg.ac.be/doc/WMC.html?SESSID=5e078d23b09e35fa3cb57563f3edac93 failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 051121 131516 70 fetching http://www.forum.math.ulg.ac.be/viewthread.html?SESSID=a514546281e260e97b0d4ffef7d3fe67id=18614 051121 131516 47 fetching http://alexandrie.droit.fundp.ac.be/Record.htm? idlist=287record=19145278280919634500 051121 131516 57 fetch of http://www.forum.math.ulg.ac.be/viewsection.html?SESSID=7384e568d8d3c2fd5b8cfacc11baa9a9id=2 failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 051121 131516 57 fetching http://www.bib.ucl.ac.be/cgi-bin/chameleon?search=KEYWORDfunction=INITREQSourceScreen=INITREQsessionid=256721skin=gandalfconf=.%2fchameleon.conflng=fr-beitemu1=1003u1=1003t1=Wanko%20Nankam,%20Carolinepos=1prevpos=1beginsrch=1 051121 131516 17 fetching http://www.forum.math.ulg.ac.be/viewsection.html?SESSID=08f49a63e3045f6481cba10cbe996eaa 051121 131516 42 fetching http://www.iagr.ucl.ac.be/planning/processus-staff/ 051121 131516 42 fetching http://www.iagr.ucl.ac.be/staff/ 051121 131516 65 fetch of http://www.forum2.math.ulg.ac.be/viewsection.html?SESSID=fa1e4296b3df0eed81bbc60b98a371f3id=11 failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out
Crawling unpolite problem
Hello, I'm fetching about 150 web servers in Belgium. My total bandwith used is around 2 Mbits. Today I had a big problem, a phone call from Belgian gouvernment saying i'm breaking down their web server. I'm crawling with unpolite parameters like (fetcher.server.delay = 0.5 and threads.per.host=15 and http.max.delay=1500). To have a polite crawler, what are the best parameters with threads.per.host =1 ? Thank you very much for your answer. Christophe Noel
[jira] Created: (NUTCH-74) French Analyzer Plugin
French Analyzer Plugin -- Key: NUTCH-74 URL: http://issues.apache.org/jira/browse/NUTCH-74 Project: Nutch Type: New Feature Environment: Nutch Reporter: Christophe Noel Attachments: analyze-french.zip This is DRAFT for a new plugin for French Analysis (all java file come from Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial forms removing, ... Analyze-frech should be used instead of NutchDocumentAnalysis as described by Jerome Charron in New Language Identifier project. It should be used also as a query-parser in Nutch searcher. We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could anyone help me to build this new Extension Point please ? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-71) Search web page doesn't not focus on query input
[ http://issues.apache.org/jira/browse/NUTCH-71?page=all ] Christophe Noel updated NUTCH-71: - Attachment: searchQueryFocus.patch Search.html (fr,en) and search.jsp focus patch. Search web page doesn't not focus on query input Key: NUTCH-71 URL: http://issues.apache.org/jira/browse/NUTCH-71 Project: Nutch Type: Bug Components: searcher Reporter: Christophe Noel Priority: Minor Attachments: searchQueryFocus.patch In search.html and search.jsp , keyboard cursor does not focus in the form query input. I've made a patch for en and fr search.html and for search.jsp. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira