[jira] Commented: (NUTCH-335) Pdf summary corrupt issue

2006-08-01 Thread Siddharudh nadgeri (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-335?page=comments#action_12424855 ] 

Siddharudh nadgeri commented on NUTCH-335:
--

I searched before but solution was not there alternative you have given and if 
you know any pdf properties that have to change then let me know 

Thanks

 Pdf summary corrupt issue
 -

 Key: NUTCH-335
 URL: http://issues.apache.org/jira/browse/NUTCH-335
 Project: Nutch
  Issue Type: Bug
 Environment: As it is web application it is not nessasary
Reporter: Siddharudh nadgeri

 I am using the Nutch search but for pdf it is giving summary as some garbage 
 like
 !!###!$%$%$###'$$ ($$$
 please provide the solution

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




fetcher improvements (was: Re: 0.8 much slower than 0.7)

2006-08-01 Thread Sami Siren

Stefan Groschupf wrote:

Hi,
I have some code using queue based mechanism and java nio.
In my tests it is 4 times faster than the existing fetcher.

But:
+ I need to fix some more bugs
+ we need to re factor the robots.txt part since it is not usable  
outside the http protocols yet.


IMO, also the code for politeness should be taken out from http
and make it protocol independent.


+ the fetcher does not support plug able protocols - only http.

I see two ways to go.
Refactor the existing robots txt parser and handle but this is a big  
change.


We should do refactoring, because it would creatly benefit the current 
fetcher also if we could schedule fetching of robots.txt before we try 
to get the content itself. eg. fetch the first 100's sites robots.txt
and after that start fetching content and unseen robots.txts for sites 
still on queue (just an example).


Or I may be prefer reimplement robots.txt parsing and handling, this  
require some more time for me.


In general we should move this discussion into nutch-dev since there  
are more site effects we should discuss.


now we have it here.

The new fetcher should be an alternative and we should not just  remove 
the old fetcher.


+1

--
 Sami Siren


[jira] Resolved: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-08-01 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-318?page=all ]

Sami Siren resolved NUTCH-318.
--

Fix Version/s: 0.8.1
   Resolution: Fixed
 Assignee: Sami Siren

marking this as resolved because it is now working ok in single node config.

 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
 Assigned To: Sami Siren
Priority: Critical
 Fix For: 0.8.1, 0.9


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-08-01 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12424930 ] 

Sami Siren commented on NUTCH-266:
--

just adding a remainder:

there are two options to get this fixed, use patched version of hadoop-0.4.0 or 
wait until hadoop-0.5.0

 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev

 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-01 Thread Chris Schneider (JIRA)
Harvested links shouldn't get db.score.injected in addition to inbound 
contributions


 Key: NUTCH-336
 URL: http://issues.apache.org/jira/browse/NUTCH-336
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Minor


Currently (even with Stefan's fix for NUTCH-324), harvested links have their 
initial scores set to db.score.injected + (sum of inbound contributions * 
db.score.link.[internal | external]), but this will place (at least external) 
harvested links even higher than injected URLs on the fetch list. Perhaps more 
importantly, this effect cascades.

As a simple example, if each page in A-B-C-D has exactly one external link 
and only A is injected, then D will receive an initial score of at least 
(4*db.score.injected) with the default db.score.link.external of 1.0. Higher 
values of db.score.injected and db.score.link.external obviously exacerbate 
this problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-01 Thread Jeremy Huylebroeck (JIRA)
Fetcher ignores the fetcher.parse value configured in config file
-

 Key: NUTCH-337
 URL: http://issues.apache.org/jira/browse/NUTCH-337
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.9
Reporter: Jeremy Huylebroeck
Priority: Trivial


using the command line call to Fetcher, if the noParsing parameter is given, 
everything is fine.
if the noParsing is not given, the value in the nutch-site.xml (or 
nutch-default.xml) should be taken but it is true that is always given to the 
call to fetch.
it should be the value from the conf.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira