spam detect

2007-07-09 Thread anton
Hello! Does nutch have any modules for spam detect? Does anyone know where I can find any information (blogs, articles, FAQ) about it?

RE: How to get score in search.jsp

2007-02-14 Thread Anton Potekhin
I have found solution. I've add variable score into Hit -Original Message- From: Anton Potekhin [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 14, 2007 10:48 AM To: nutch-dev@lucene.apache.org Subject: How to get score in search.jsp Importance: High Hi Nutch Gurus! I have

How to get score in search.jsp

2007-02-13 Thread Anton Potekhin
Hi Nutch Gurus! I have a small problem. I need to add some changes into search.jsp. I need to get first 50 results and to sort them in different way. I will change the score of each result with formula new_score = nutch_score + domain_score_from_my_db to sort. But i don't understand how to get

deep limitation

2006-11-06 Thread anton
Does Nutch 0.7.2 have any deep limitation? I added a few pages. I need processing this pages and all pages which located 3 (for example) clicks away from added pages. I think, I explain clearly ;-)

RE: indexing problem

2006-09-07 Thread anton
Nutch is not compatible with latest hadoop from svn. Nutch works coorect after small tuning with latest hadoop from svn ;-)

indexing problem

2006-09-06 Thread anton
I've got latest versions of nutch (0.9-dev) and hadoop (Trunk) from svn. When I try to index I get the next error: java.lang.ClassCastException: org.apache.nutch.parse.ParseData at org.apache.nutch.indexer.Indexer$InputFormat$1.next(Indexer.java:92) at

limitation

2006-09-04 Thread anton
How to limit the pages number processed from each domain? And how to setup nutch to crawl only domains added by me (i.e. make nutch to ignore external links)? If nutch doesn't allow it then what algorithm will be the best for it? p.s. nutch ver.0.7

Fetch error

2006-08-30 Thread anton
I update hadoop but I am get next error now on fetch step (reduce): 06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334% reduce copy (6 of 6 at 11.77 MB/s) 06/08/29 08:31:20 WARN /: /getMapOutput.jsp?map=task_0003_m_02_0reduce=1: java.lang.IllegalStateException

RE: Fetch error

2006-08-30 Thread anton
Preview error I got from tasktracker log. In jobtracker log I am see next error now: 06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS

RE: problem with nutch

2006-08-25 Thread anton
I tried start job tracker without tomcat. -Original Message- From: Chris Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 23, 2006 6:16 PM To: nutch-dev@lucene.apache.org Subject: Re: problem with nutch Importance: High This is probably a better question for the user list.

RE: problem with nutch

2006-08-25 Thread anton
If be exacеt. When I started job tracker on given server was loaded only namenode. All ports from hadoop-default.xml not used. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, August 25, 2006 10:48 AM To: nutch-dev@lucene.apache.org Subject: RE:

RE: problem with nutch

2006-08-25 Thread anton
In Addition please draw attention on next part of log: 06/08/25 05:07:59 WARN servlet.WebApplicationContext: Web application not found /spider_kakle_mapred/spider/conf:/spider_ 06/08/25 05:07:59 WARN servlet.WebApplicationContext: Configuration error on

problem with nutch

2006-08-23 Thread anton
When I try start nutch 0.8 I get errors. How I can solve this problem? JobTracker log: ...Skiped... 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is little 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 06/08/23 05:19:40 INFO util.Credential:

some questions

2006-08-18 Thread anton
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried about nutch's requirements to HDD free space. For example, suppose I have 1) server with job tracker and namenode 2) 5 servers with task trackers and 20 Gb HDDs 3) 5 servers with datenode and 20 Gb HDDs also

RE: nutch

2006-08-02 Thread anton
My settings: property namemapred.local.dir/name value/hadoop/mapred/local/value descriptionThe local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. /description /property

Problem opening checksum file

2006-06-22 Thread anton
I create file on dfs (for example filename done). After I try copy this file from dfs to local filesystem. In result I get this file in local filesystem and error: Problem opening checksum file: /user/root/crawl/done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: jav

search speed

2006-06-15 Thread anton
I using dfs. My index contain 3706249 documents. Presently, searching for occupies from 2 before 4 seconds (I test on query with 3 search term). Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think search is very slow now. We can make search faster? What factors influence

free disk space

2006-06-14 Thread anton
I'm using nutch v.0.8 and have 3 computers. Two of them have datanode and tasktracker running, another one has name node and jobtracker running. Do I need more disk space with tasktrackers and jobtracker running, as the number of pages processed is growing along with the size of database? Would

No space left on device

2006-06-14 Thread anton
I'm using nutch v.0.8 and have 3 computers. One of my tasktrakers always go down. This occurs during indexing (index crawl/indexes). On server with crashed tasktracker now available 53G of free disk space and used only 11G. How i can decide this problem? Why tasktarcker requires so much free

RE: No space left on device

2006-06-14 Thread anton
Yes, I use dfs. How configure nutch for decide problem with disk space? How control number of smaller files? -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 14, 2006 5:46 PM To: nutch-dev@lucene.apache.org Subject: Re: No space left on device

RE: resolving IP in...

2006-06-07 Thread anton
Anyone knows where can I download the nutch version 0.8? I can't find this one :( http://svn.apache.org/repos/asf/lucene/nutch/trunk/

summary

2006-06-05 Thread anton
My Nutch processed pages http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm. When I try make search for search term lingerie nutch bring up results with bad summary (... Lingerie, Lingerie, Lingerie,

RE: summary

2006-06-05 Thread anton
It's not a problem of Nutch! Do you Try a spamdexing ? Yes. I understand this... But how fight with this spam? -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : lundi 5 juin 2006 11:43 À : nutch-dev@lucene.apache.org Objet : summary My Nutch processed

error

2006-05-22 Thread anton
I updated any plugins... And now I get errors in tomcat log: May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin (summary-basic), extension point: org.apache.nutch.searcher.Summarizer does not exist. How fix this

to count the number of pages from each domain

2006-05-05 Thread anton
We tried to develop a solution to count the number of pages from each domain. We thought to do it so: .map - had following input k - UTF8 (url of page), v - CrawlDatum and following output k - UTF8 (domain of page), v - UrlAndPage implemented Writable (structure which contained url of page and

JobTrackerInfoServer and nutch*.jar

2006-05-01 Thread anton
Why jsp scripts launched under JobTrackerInfoServer do not see classes from из nutch*.jar? How to point the JobTrackerInfoServer to use nutch*.jar?

new parameters

2006-04-28 Thread anton
We see new parameters in hadoop-default.xml: dfs.replication.max, dfs.replication.min. What these parameters do mean?

RE: exception

2006-04-27 Thread anton
nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx

update crawldb

2006-04-24 Thread Anton Potehin
How to update info about links already added to db. Particularly we need to update status of some part of links. What classes should we use to read info about each link stored in DB and then update its status? We use Trunc branch of Nutch.

mapred.map.tasks

2006-04-20 Thread Anton Potehin
property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property We have a question on this

RE: question about crawldb

2006-04-19 Thread anton
-Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 18, 2006 8:02 PM To: nutch-dev@lucene.apache.org Subject: Re: question about crawldb Importance: High Anton Potehin wrote: 1.We have found these flags in CrawlDatum class: public static

question about crawldb

2006-04-18 Thread Anton Potehin
1. We have found these flags in CrawlDatum class: public static final byte STATUS_SIGNATURE = 0; public static final byte STATUS_DB_UNFETCHED = 1; public static final byte STATUS_DB_FETCHED = 2; public static final byte STATUS_DB_GONE = 3; public static final byte

mapred branch

2006-04-10 Thread Anton Potehin
Where now placed mapred branch of nutch ?

image search

2006-04-10 Thread Anton Potehin
Somebody try create image search based on nutch ?

Killing lines

2005-12-06 Thread anton
There is snippet from TaskTracker log file: 051206 090643 Task task_r_qegmsh timed out. Killing. 051206 090646 Task task_r_qegmsh timed out. Killing. 051206 090649 Task task_r_qegmsh timed out. Killing. 051206 090652 Task task_r_qegmsh timed out. Killing. 051206 090655 Task task_r_qegmsh

mapred crawl

2005-11-23 Thread Anton Potehin
We used nutch for whole web crawling. In infinite loop we run tasks: 1) bin/nutch generate db segmentsPath -topN 1 2) bin/nutch fetch segment name 3) bin/nutch updatedb db segment name 4) bin/nutch analyze db segment name 5) bin/nutch index segment name 6) bin/nutch dedup segments

About tomcat

2005-11-21 Thread Anton Potehin
We come to decision, we need restart webapp for new results appeared in search. How to this correctly without restarting tomcat? After long work of tomcat,  we have too many open files error. May be this is result of restarting of webapp by touch command on web.xml? By now before tomcat

jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
How to use jobtracker.jsp and jobdetails.jsp? They need tomcat? When I try start jobdetails.jsp with tomcat, it return error: java.lang.NullPointerException at org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp: 53) at

RE: jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
They not need tomcat? But then, what we must type in browser address? http://host_jobtracker:port_jobtracer/jobtracker/jobtracker.jsp ? -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, November 21, 2005 12:46 PM To: nutch-dev@lucene.apache.org Subject:

RE: jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem.

RE: mapred.map.tasks

2005-11-21 Thread anton
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111. In nutch-site.xml I specified parameters: 1) On the both machines: property namefs.default.name/name value192.168.0.250:9009/value descriptionThe name of the default file system. Either the literal string local or

rank system

2005-11-08 Thread Anton Potehin
What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with page scores calculating. Does the mapred use ranking system somehow? Is it possible to use mapred for clustering whole-web crawling or it works with Intranet Crawling only?

RE: rank system

2005-11-08 Thread anton
-dev@lucene.apache.org Subject: Re: rank system Pre score calculation is done in the indexer. Yes it works with complete webcrawls as well, and it works very well for that. :-) Stefan Am 08.11.2005 um 11:22 schrieb Anton Potehin: What about scoring in mapred? I have looked crawl/crawl.java

questions

2005-11-08 Thread Anton Potehin
After I looked thru Crawl.java I exploded all tasks for several phases: 1) Inject - here we add web-links into crawlDb 2) Generate segment - here we create data segment 3) Fetching 4) Parse segment 5) Update crawlDb - here the information is added from segment

RE: questions

2005-11-08 Thread anton
. you will find some presentation slides in the wiki, HTH Stefan Am 08.11.2005 um 14:31 schrieb Anton Potehin: After I looked thru Crawl.java I exploded all tasks for several phases: 1) Inject - here we add web-links into crawlDb 2) Generate segment - here we create data segment