RE: nutch
My settings: mapred.local.dir /hadoop/mapred/local The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. mapred.system.dir /hadoop/mapred/system The shared directory where MapReduce stores control files. My device which mounted onto "/" have free space is 115G. [EMAIL PROTECTED] /]# df -h FilesystemSize Used Avail Use% Mounted on /dev/sda2 133G 13G 113G 11% / Anybody have other ideas? -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 02, 2006 6:01 PM To: nutch-dev@lucene.apache.org Subject: Re: nutch Importance: High most propably you have run out of space in tmp (local) filesystem use properties like mapred.system.dir mapred.local.dir in hadoop-site.xml to get over this problem. [EMAIL PROTECTED] wrote: >I forget ;-) One more question: >This problem with nutch or hadoop? > >-Original Message- >From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] >Sent: Wednesday, August 02, 2006 11:38 AM >To: nutch-dev@lucene.apache.org >Subject: nutch >Importance: High > >I use nutch 0.8(mapred). Nutch started on 3 servers. >When my nutch try index segment I get error on tasktracker: > > > > > > > >
[jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ] Chris Schneider updated NUTCH-336: -- Attachment: NUTCH-336.patch.txt Here's a patch that fixes the problem. It separates a new injectionScore API out from the initialScore API. > Harvested links shouldn't get db.score.injected in addition to inbound > contributions > > > Key: NUTCH-336 > URL: http://issues.apache.org/jira/browse/NUTCH-336 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.8 >Reporter: Chris Schneider >Priority: Minor > Attachments: NUTCH-336.patch.txt > > > Currently (even with Stefan's fix for NUTCH-324), harvested links have their > initial scores set to db.score.injected + (sum of inbound contributions * > db.score.link.[internal | external]), but this will place (at least external) > harvested links even higher than injected URLs on the fetch list. Perhaps > more importantly, this effect cascades. > As a simple example, if each page in A->B->C->D has exactly one external link > and only A is injected, then D will receive an initial score of at least > (4*db.score.injected) with the default db.score.link.external of 1.0. Higher > values of db.score.injected and db.score.link.external obviously exacerbate > this problem. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: nutch
most propably you have run out of space in tmp (local) filesystem use properties like mapred.system.dir mapred.local.dir in hadoop-site.xml to get over this problem. [EMAIL PROTECTED] wrote: I forget ;-) One more question: This problem with nutch or hadoop? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 02, 2006 11:38 AM To: nutch-dev@lucene.apache.org Subject: nutch Importance: High I use nutch 0.8(mapred). Nutch started on 3 servers. When my nutch try index segment I get error on tasktracker:
nutch/lucene question..
hi. i have a possible project where i'm looking at extracting information from various public/college websites. i don't need to index the text/content of the sites. i do need to extract specific information. as an example, a site might have a course schedule page, which in turn has links to the departments page, which in turn has links to the class information page. from a tree structure this would be: course listings by semester departments for the semester classes of each department class information obviously nutch/lucene has the ability to crawl a given site, does it have the ability to somehow 'link'/maintain a given relationship to the upstream page for a given piece of information. for my needs, i need to maintain the semester i get, as i follow the "link to the department, etc... this approach allows me to then store the complete course information in a db, so i can then iterate through the course information. i can accomplish this now, by creating a unique crawling app for each school. my curiousity is whether nutch/lucene can provide a basic crawling engine that i could then plug into, for my specific needs. i'm also curious as to the amount of additional development that would have to be created for my needs... thanks -bruce
[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ] Renaud Richardet updated NUTCH-266: --- Attachment: patch.diff Thank you Sami, We had a similar problem with Win XP and were able to fix it by using hadoop-nightly.jar. However, because of some changes in Hadoop (http://issues.apache.org/jira/browse/HADOOP-252), Nutch would not compile anymore. The attached patch will solve this. Let us know if there is a better way. > hadoop bug when doing updatedb > -- > > Key: NUTCH-266 > URL: http://issues.apache.org/jira/browse/NUTCH-266 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8 > Environment: windows xp, JDK 1.4.2_04 >Reporter: Eugen Kochuev > Attachments: patch.diff > > > I constantly get the following error message > 060508 230637 Running job: job_pbhn3t > 060508 230637 > c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 > 060508 230637 > c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 > 060508 230637 > c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 > 060508 230637 job_pbhn3t > java.io.IOException: Target > /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists > at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) > at > org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) > at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) > at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: nutch
I forget ;-) One more question: This problem with nutch or hadoop? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 02, 2006 11:38 AM To: nutch-dev@lucene.apache.org Subject: nutch Importance: High I use nutch 0.8(mapred). Nutch started on 3 servers. When my nutch try index segment I get error on tasktracker:
nutch
I use nutch 0.8(mapred). Nutch started on 3 servers. When my nutch try index segment I get error on tasktracker: 060727 215111 task_0025_r_00_1 SEVERE FSError from child 060727 215111 task_0025_r_00_1 org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device 060727 215111 task_0025_r_00_1 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile Sys tem.java:152) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java :69 ) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre am. java:98) 060727 215111 task_0025_r_00_1 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 060727 215111 task_0025_r_00_1 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) 060727 215111 task_0025_r_00_1 at java.io.DataOutputStream.write(DataOutputStream.java:90) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:192) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java: 873 ) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:760 ) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:696) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:522) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755) 060727 215111 task_0025_r_00_1 Caused by: java.io.IOException: No space left on device 060727 215111 task_0025_r_00_1 at java.io.FileOutputStream.writeBytes(Native Method) 060727 215111 task_0025_r_00_1 at java.io.FileOutputStream.write(FileOutputStream.java:260) 060727 215111 task_0025_r_00_1 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile Sys tem.java:150) 060727 215111 task_0025_r_00_1 ... 12 more But on server with tasktracker free space on the HDD is 115G. I try get segment from dfs. Segment occupies 2,4G on HDD. Why I get this errors? Anybody can help me decide this problem?