Hello!
Does nutch have any modules for spam detect?
Does anyone know where I can find any information (blogs, articles, FAQ)
about it?
I have found solution. I've add variable score into Hit
-Original Message-
From: Anton Potekhin [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 14, 2007 10:48 AM
To: nutch-dev@lucene.apache.org
Subject: How to get score in search.jsp
Importance: High
Hi Nutch Gurus!
I have
Hi Nutch Gurus!
I have a small problem. I need to add some changes into search.jsp. I need
to get first 50 results and to sort them in different way. I will change the
score of each result with formula new_score = nutch_score +
domain_score_from_my_db to sort. But i don't understand how to get
Does Nutch 0.7.2 have any deep limitation?
I added a few pages. I need processing this pages and all pages which
located 3 (for example) clicks away from added pages.
I think, I explain clearly ;-)
Nutch is not compatible with latest hadoop from svn.
Nutch works coorect after small tuning with latest hadoop from svn ;-)
I've got latest versions of nutch (0.9-dev) and hadoop (Trunk) from svn.
When I try to index I get the next error:
java.lang.ClassCastException: org.apache.nutch.parse.ParseData
at org.apache.nutch.indexer.Indexer$InputFormat$1.next(Indexer.java:92)
at
How to limit the pages number processed from each domain? And how to setup
nutch to crawl only domains added by me (i.e. make nutch to ignore external
links)? If nutch doesn't allow it then what algorithm will be the best for
it?
p.s. nutch ver.0.7
I update hadoop but I am get next error now on fetch step (reduce):
06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334%
reduce copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_02_0reduce=1:
java.lang.IllegalStateException
Preview error I got from tasktracker log. In jobtracker log I am see next
error now:
06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from
task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n
utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS
I tried start job tracker without tomcat.
-Original Message-
From: Chris Stephens [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 23, 2006 6:16 PM
To: nutch-dev@lucene.apache.org
Subject: Re: problem with nutch
Importance: High
This is probably a better question for the user list.
If be exacеt. When I started job tracker on given server was loaded only
namenode. All ports from hadoop-default.xml not used.
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Friday, August 25, 2006 10:48 AM
To: nutch-dev@lucene.apache.org
Subject: RE:
In Addition please draw attention on next part of log:
06/08/25 05:07:59 WARN servlet.WebApplicationContext: Web application not
found /spider_kakle_mapred/spider/conf:/spider_
06/08/25 05:07:59 WARN servlet.WebApplicationContext: Configuration error on
When I try start nutch 0.8 I get errors. How I can solve this problem?
JobTracker log:
...Skiped...
06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is
little
06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is
06/08/23 05:19:40 INFO util.Credential:
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried
about nutch's requirements to HDD free space.
For example, suppose I have
1) server with job tracker and namenode
2) 5 servers with task trackers and 20 Gb HDDs
3) 5 servers with datenode and 20 Gb HDDs also
My settings:
property
namemapred.local.dir/name
value/hadoop/mapred/local/value
descriptionThe local directory where MapReduce stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk i/o.
/description
/property
I create file on dfs (for example filename done). After I try copy this
file from dfs to local filesystem. In result I get this file in local
filesystem and error:
Problem opening checksum file: /user/root/crawl/done. Ignoring with
exception org.apache.hadoop.ipc.RemoteException: jav
I using dfs. My index contain 3706249 documents. Presently, searching for
occupies from 2 before 4 seconds (I test on query with 3 search term).
Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think
search is very slow now.
We can make search faster?
What factors influence
I'm using nutch v.0.8 and have 3 computers. Two of them have datanode and
tasktracker running, another one has name node and jobtracker running. Do I
need more disk space with tasktrackers and jobtracker running, as the
number of pages processed is growing along with the size of database? Would
I'm using nutch v.0.8 and have 3 computers.
One of my tasktrakers always go down.
This occurs during indexing (index crawl/indexes). On server with crashed
tasktracker now available 53G of free disk space and used only 11G.
How i can decide this problem? Why tasktarcker requires so much free
Yes, I use dfs.
How configure nutch for decide problem with disk space? How control number
of smaller files?
-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 14, 2006 5:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: No space left on device
Anyone knows where can I download the nutch version 0.8? I can't find this
one :(
http://svn.apache.org/repos/asf/lucene/nutch/trunk/
My Nutch processed pages
http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and
http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm.
When I try make search for search term lingerie nutch bring up results
with bad summary (... Lingerie, Lingerie, Lingerie,
It's not a problem of Nutch!
Do you Try a spamdexing ?
Yes. I understand this... But how fight with this spam?
-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Envoyé : lundi 5 juin 2006 11:43
À : nutch-dev@lucene.apache.org
Objet : summary
My Nutch processed
I updated any plugins... And now I get errors in tomcat log:
May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init
SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin
(summary-basic), extension point: org.apache.nutch.searcher.Summarizer does
not exist.
How fix this
We tried to develop a solution to count the number of pages from each
domain.
We thought to do it so:
.map - had following input k - UTF8 (url of page), v - CrawlDatum and
following output k - UTF8 (domain of page), v - UrlAndPage implemented
Writable (structure which contained url of page and
Why jsp scripts launched under JobTrackerInfoServer do not see classes from
из nutch*.jar? How to point the JobTrackerInfoServer to use nutch*.jar?
We see new parameters in hadoop-default.xml: dfs.replication.max,
dfs.replication.min.
What these parameters do mean?
nightly build of Hadoop to see if
it works any better.
Doug
Anton Potehin wrote:
What means error of following type :
java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
block for file /user/root/crawl/indexes/index/_0.prx
How to update info about links already added to db. Particularly we need
to update status of some part of links. What classes should we use to
read info about each link stored in DB and then update its status? We
use Trunc branch of Nutch.
property
namemapred.map.tasks/name
value2/value
descriptionThe default number of map tasks per job. Typically set
to a prime several times greater than number of available hosts.
Ignored when mapred.job.tracker is local.
/description
/property
We have a question on this
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 18, 2006 8:02 PM
To: nutch-dev@lucene.apache.org
Subject: Re: question about crawldb
Importance: High
Anton Potehin wrote:
1.We have found these flags in CrawlDatum class:
public static
1. We have found these flags in CrawlDatum class:
public static final byte STATUS_SIGNATURE = 0;
public static final byte STATUS_DB_UNFETCHED = 1;
public static final byte STATUS_DB_FETCHED = 2;
public static final byte STATUS_DB_GONE = 3;
public static final byte
Where now placed mapred branch of nutch ?
Somebody try create image search based on nutch ?
There is snippet from TaskTracker log file:
051206 090643 Task task_r_qegmsh timed out. Killing.
051206 090646 Task task_r_qegmsh timed out. Killing.
051206 090649 Task task_r_qegmsh timed out. Killing.
051206 090652 Task task_r_qegmsh timed out. Killing.
051206 090655 Task task_r_qegmsh
We used nutch for whole web crawling.
In infinite loop we run tasks:
1) bin/nutch generate db segmentsPath -topN 1
2) bin/nutch fetch segment name
3) bin/nutch updatedb db segment name
4) bin/nutch analyze db segment name
5) bin/nutch index segment name
6) bin/nutch dedup segments
We come to decision, we need restart webapp for new results appeared in search.
How to this correctly without restarting tomcat?
After long work of tomcat, we have too many open files error. May be this is
result of restarting of webapp by touch command on web.xml? By now before
tomcat
How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat?
When I try start jobdetails.jsp with tomcat, it return error:
java.lang.NullPointerException
at
org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp:
53)
at
They not need tomcat? But then, what we must type in browser address?
http://host_jobtracker:port_jobtracer/jobtracker/jobtracker.jsp ?
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Monday, November 21, 2005 12:46 PM
To: nutch-dev@lucene.apache.org
Subject:
Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got negative progress
percentages problem.
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.
In nutch-site.xml I specified parameters:
1) On the both machines:
property
namefs.default.name/name
value192.168.0.250:9009/value
descriptionThe name of the default file system. Either the
literal string local or
What about scoring in mapred? I have looked crawl/crawl.java but I did
not found anything concerned with page scores calculating. Does the
mapred use ranking system somehow?
Is it possible to use mapred for clustering whole-web crawling or it
works with Intranet Crawling only?
-dev@lucene.apache.org
Subject: Re: rank system
Pre score calculation is done in the indexer.
Yes it works with complete webcrawls as well, and it works very well
for that. :-)
Stefan
Am 08.11.2005 um 11:22 schrieb Anton Potehin:
What about scoring in mapred? I have looked crawl/crawl.java
After I looked thru Crawl.java I exploded all tasks for several phases:
1) Inject - here we add web-links into crawlDb
2) Generate segment - here we create data segment
3) Fetching
4) Parse segment
5) Update crawlDb - here the information is added from segment
.
you will find some presentation slides in the wiki,
HTH
Stefan
Am 08.11.2005 um 14:31 schrieb Anton Potehin:
After I looked thru Crawl.java I exploded all tasks for several
phases:
1) Inject - here we add web-links into crawlDb
2) Generate segment - here we create data segment
45 matches
Mail list logo