RE: nutch

2006-08-02 Thread anton
My settings:


  mapred.local.dir
  /hadoop/mapred/local
  The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  



  mapred.system.dir
  /hadoop/mapred/system
  The shared directory where MapReduce stores control files.
  



My device which mounted onto "/" have free space is 115G.

[EMAIL PROTECTED] /]# df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda2 133G   13G  113G  11% /

Anybody have other ideas?








-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 02, 2006 6:01 PM
To: nutch-dev@lucene.apache.org
Subject: Re: nutch
Importance: High

most propably you have run out of space in tmp (local) filesystem

use properties like


  mapred.system.dir
  


  mapred.local.dir
  


in hadoop-site.xml to get over this problem.


[EMAIL PROTECTED] wrote:

>I forget ;-) One more question:
>This problem with nutch or hadoop?
>
>-Original Message-
>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
>Sent: Wednesday, August 02, 2006 11:38 AM
>To: nutch-dev@lucene.apache.org
>Subject: nutch
>Importance: High
>
>I use nutch 0.8(mapred). Nutch started on 3 servers.
>When my nutch try index segment I get error on tasktracker:
>
>
>
>
>
>
>  
>





[jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-02 Thread Chris Schneider (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-336?page=all ]

Chris Schneider updated NUTCH-336:
--

Attachment: NUTCH-336.patch.txt

Here's a patch that fixes the problem. It separates a new injectionScore API 
out from the initialScore API.

> Harvested links shouldn't get db.score.injected in addition to inbound 
> contributions
> 
>
> Key: NUTCH-336
> URL: http://issues.apache.org/jira/browse/NUTCH-336
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Chris Schneider
>Priority: Minor
> Attachments: NUTCH-336.patch.txt
>
>
> Currently (even with Stefan's fix for NUTCH-324), harvested links have their 
> initial scores set to db.score.injected + (sum of inbound contributions * 
> db.score.link.[internal | external]), but this will place (at least external) 
> harvested links even higher than injected URLs on the fetch list. Perhaps 
> more importantly, this effect cascades.
> As a simple example, if each page in A->B->C->D has exactly one external link 
> and only A is injected, then D will receive an initial score of at least 
> (4*db.score.injected) with the default db.score.link.external of 1.0. Higher 
> values of db.score.injected and db.score.link.external obviously exacerbate 
> this problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: nutch

2006-08-02 Thread Sami Siren

most propably you have run out of space in tmp (local) filesystem

use properties like


 mapred.system.dir
 


 mapred.local.dir
 


in hadoop-site.xml to get over this problem.


[EMAIL PROTECTED] wrote:


I forget ;-) One more question:
This problem with nutch or hadoop?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 02, 2006 11:38 AM

To: nutch-dev@lucene.apache.org
Subject: nutch
Importance: High

I use nutch 0.8(mapred). Nutch started on 3 servers.
When my nutch try index segment I get error on tasktracker:






 





nutch/lucene question..

2006-08-02 Thread bruce
hi.

i have a possible project where i'm looking at extracting information from
various public/college websites. i don't need to index the text/content of
the sites. i do need to extract specific information.

as an example, a site might have a course schedule page, which in turn has
links to the departments page, which in turn has links to the class
information page. from a tree structure this would be:
  course listings by semester
 departments for the semester
   classes of each department
  class information

obviously nutch/lucene has the ability to crawl a given site, does it have
the ability to somehow 'link'/maintain a given relationship to the upstream
page for a given piece of information.

for my needs, i need to maintain the semester i get, as i follow the "link
to the department, etc... this approach allows me to then store the complete
course information in a db, so i can then iterate through the course
information.

i can accomplish this now, by creating a unique crawling app for each
school. my curiousity is whether nutch/lucene can provide a basic crawling
engine that i could then plug into, for my specific needs. i'm also curious
as to the amount of additional development that would have to be created for
my needs...

thanks

-bruce



[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb

2006-08-02 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]

Renaud Richardet updated NUTCH-266:
---

Attachment: patch.diff

Thank you Sami,

We had a similar problem with Win XP and were able to fix it by using 
hadoop-nightly.jar. However, because of some changes in Hadoop 
(http://issues.apache.org/jira/browse/HADOOP-252), Nutch would not compile 
anymore. The attached patch will solve this. Let us know if there is a better 
way.


> hadoop bug when doing updatedb
> --
>
> Key: NUTCH-266
> URL: http://issues.apache.org/jira/browse/NUTCH-266
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
> Environment: windows xp, JDK 1.4.2_04
>Reporter: Eugen Kochuev
> Attachments: patch.diff
>
>
> I constantly get the following error message
> 060508 230637 Running job: job_pbhn3t
> 060508 230637 
> c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
> 060508 230637 
> c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
> 060508 230637 
> c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
> 060508 230637 job_pbhn3t
> java.io.IOException: Target 
> /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
> at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
> at 
> org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
> at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




RE: nutch

2006-08-02 Thread anton
I forget ;-) One more question:
This problem with nutch or hadoop?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 02, 2006 11:38 AM
To: nutch-dev@lucene.apache.org
Subject: nutch
Importance: High

I use nutch 0.8(mapred). Nutch started on 3 servers.
When my nutch try index segment I get error on tasktracker:







nutch

2006-08-02 Thread anton
I use nutch 0.8(mapred). Nutch started on 3 servers.
When my nutch try index segment I get error on tasktracker:
060727 215111 task_0025_r_00_1  SEVERE FSError from child
060727 215111 task_0025_r_00_1 org.apache.hadoop.fs.FSError:
java.io.IOException: No space left on device
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
Sys
tem.java:152)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java
:69
)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre
am.
java:98)
060727 215111 task_0025_r_00_1  at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
060727 215111 task_0025_r_00_1  at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
060727 215111 task_0025_r_00_1  at
java.io.DataOutputStream.write(DataOutputStream.java:90)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:192)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:
873
)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:760
)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:696)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:522)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755)
060727 215111 task_0025_r_00_1 Caused by: java.io.IOException: No space
left on device
060727 215111 task_0025_r_00_1  at
java.io.FileOutputStream.writeBytes(Native Method)
060727 215111 task_0025_r_00_1  at
java.io.FileOutputStream.write(FileOutputStream.java:260)
060727 215111 task_0025_r_00_1  at
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
Sys
tem.java:150)
060727 215111 task_0025_r_00_1  ... 12 more


But on server with tasktracker free space on the HDD is 115G. I try get
segment from dfs. Segment occupies 2,4G on HDD. Why I get this errors?
Anybody can help me decide this problem?