Hi, I am having problems with search results with nutch. I am using nutch-0.7.2. I am crawling exactly the hand-picked pages. I have crawled 5 sites totally.
I have lot of results that are coming from 1 site (let's say Site#A) and are always ranked better than results from other 4 sites. When I click on 'Show all hits' - Results from FIRST 10 PAGES are from site #A. only last results are from non-site#A. The reason I believe is because: a) That site(Site#A) contains internal links to their pages a lot in each page. b) For a query I typed, there are 63 results in total. Out of these 63, there are 55 results coming from that site(Site#A). Hence, their score is high. Q1: Do you know how I can solve this problem? Q2: Since I am also hand-picking the exact pages, what are the fields can I a) reduce/disable? Like URL, anchor...what else? b) increase/enable? Like title...what else? Since the focus is going to be on content, what fields do I need increase? c) can I change some class or nutch behavior other than this? Q3: I have already crawled 5 sites. Should I recrawl these sites with these values set? Q4: If I don't need to recrawl, do I need to run other some operations like re-indexing? Q5: Will it help if I move to newer version of nutch? Should I recrawl again since I moved to newer version and the previous crawl data are not binary compatible? Any help will be appreciated. Thanks. -- View this message in context: http://www.nabble.com/nutch-search-results-problem-tf3648963.html#a10192478 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
