[jira] Commented: (NUTCH-335) Pdf summary corrupt issue
[ http://issues.apache.org/jira/browse/NUTCH-335?page=comments#action_12424855 ] Siddharudh nadgeri commented on NUTCH-335: -- I searched before but solution was not there alternative you have given and if you know any pdf properties that have to change then let me know Thanks Pdf summary corrupt issue - Key: NUTCH-335 URL: http://issues.apache.org/jira/browse/NUTCH-335 Project: Nutch Issue Type: Bug Environment: As it is web application it is not nessasary Reporter: Siddharudh nadgeri I am using the Nutch search but for pdf it is giving summary as some garbage like !!###!$%$%$###'$$ ($$$ please provide the solution -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
fetcher improvements (was: Re: 0.8 much slower than 0.7)
Stefan Groschupf wrote: Hi, I have some code using queue based mechanism and java nio. In my tests it is 4 times faster than the existing fetcher. But: + I need to fix some more bugs + we need to re factor the robots.txt part since it is not usable outside the http protocols yet. IMO, also the code for politeness should be taken out from http and make it protocol independent. + the fetcher does not support plug able protocols - only http. I see two ways to go. Refactor the existing robots txt parser and handle but this is a big change. We should do refactoring, because it would creatly benefit the current fetcher also if we could schedule fetching of robots.txt before we try to get the content itself. eg. fetch the first 100's sites robots.txt and after that start fetching content and unseen robots.txts for sites still on queue (just an example). Or I may be prefer reimplement robots.txt parsing and handling, this require some more time for me. In general we should move this discussion into nutch-dev since there are more site effects we should discuss. now we have it here. The new fetcher should be an alternative and we should not just remove the old fetcher. +1 -- Sami Siren
[jira] Resolved: (NUTCH-318) log4j not proper configured, readdb doesnt give any information
[ http://issues.apache.org/jira/browse/NUTCH-318?page=all ] Sami Siren resolved NUTCH-318. -- Fix Version/s: 0.8.1 Resolution: Fixed Assignee: Sami Siren marking this as resolved because it is now working ok in single node config. log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318 URL: http://issues.apache.org/jira/browse/NUTCH-318 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Assigned To: Sami Siren Priority: Critical Fix For: 0.8.1, 0.9 In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps the information to the console, but not in a nice way. What makes me wonder is that these information should be also in the log file, but the arn't, so there are may be even here problems. Also what is the different between hadoop-XXX-jobtracker-XXX.out and hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12424930 ] Sami Siren commented on NUTCH-266: -- just adding a remainder: there are two options to get this fixed, use patched version of hadoop-0.4.0 or wait until hadoop-0.5.0 hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev I constantly get the following error message 060508 230637 Running job: job_pbhn3t 060508 230637 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 060508 230637 job_pbhn3t java.io.IOException: Target /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) at org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions
Harvested links shouldn't get db.score.injected in addition to inbound contributions Key: NUTCH-336 URL: http://issues.apache.org/jira/browse/NUTCH-336 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Chris Schneider Priority: Minor Currently (even with Stefan's fix for NUTCH-324), harvested links have their initial scores set to db.score.injected + (sum of inbound contributions * db.score.link.[internal | external]), but this will place (at least external) harvested links even higher than injected URLs on the fetch list. Perhaps more importantly, this effect cascades. As a simple example, if each page in A-B-C-D has exactly one external link and only A is injected, then D will receive an initial score of at least (4*db.score.injected) with the default db.score.link.external of 1.0. Higher values of db.score.injected and db.score.link.external obviously exacerbate this problem. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file
Fetcher ignores the fetcher.parse value configured in config file - Key: NUTCH-337 URL: http://issues.apache.org/jira/browse/NUTCH-337 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.9 Reporter: Jeremy Huylebroeck Priority: Trivial using the command line call to Fetcher, if the noParsing parameter is given, everything is fine. if the noParsing is not given, the value in the nutch-site.xml (or nutch-default.xml) should be taken but it is true that is always given to the call to fetch. it should be the value from the conf. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira