[Nutch-dev] 代理业务

2006-08-29 Thread sz7788a
致公司经理或财务负责人 您好!本公司有多种发票可以代理.欢迎来电咨询'洽谈:13824373512林先生 可网上查询验证,期待与您真诚合作!如打扰勿怪,谢谢 邮件信箱:[EMAIL PROTECTED] - Using Tomcat but need to do more? Need to support web services, security? Get st

Re: [Nutch-dev] Hadoop job question

2006-08-29 Thread HUYLEBROECK Jeremy RD-ILAB-SSF
Thanks for the pointer. It does perfectly the job! -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 29, 2006 7:59 AM To: nutch-dev@lucene.apache.org Subject: Re: Hadoop job question Although it is kinda hacking the system you may be able to do it in

Re: [Nutch-dev] [Nutch Wiki] Update of "RunNutchInEclipse" by UrosG

2006-08-29 Thread Uroš Gruber
Stefan Groschupf wrote: > Hi, > >> + You may have problems with some imports in parse-mp3 and parse-rtf >> plugins. Because of incompatibility with apache licence they were >> left from sources. You can find it here: >> + >> + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

Re: [Nutch-dev] [Nutch Wiki] Update of "RunNutchInEclipse" by UrosG

2006-08-29 Thread Stefan Groschupf
Hi, > + You may have problems with some imports in parse-mp3 and parse- > rtf plugins. Because of incompatibility with apache licence they > were left from sources. You can find it here: > + > + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/ > lib/ > + > + http://nutch.cvs.

Re: [Nutch-dev] books (and articles) about search engine algorithms

2006-08-29 Thread Andrzej Bialecki
Mladen Adamovic wrote: > Hi! > > I want to get more insight into various search engine algorithms. I > have wide knowledge of standard data structures & algorithms > (hashvalues, trees, graphs, etc.). I thought that Lucene would be > good place to start to seek for information and indeed I've f

[Nutch-dev] books (and articles) about search engine algorithms

2006-08-29 Thread Mladen Adamovic
Hi! I want to get more insight into various search engine algorithms. I have wide knowledge of standard data structures & algorithms (hashvalues, trees, graphs, etc.). I thought that Lucene would be good place to start to seek for information and indeed I've found some decent information at N

Re: [Nutch-dev] Hadoop job question

2006-08-29 Thread Dennis Kubes
Although it is kinda hacking the system you may be able to do it in the map method by writing a custom MapRunner and having an object that lives in the MapRunner but that you set into each mapper instance. Dennis HUYLEBROECK Jeremy RD-ILAB-SSF wrote: > I currently have a MR task that reads a Se

Re: [Nutch-dev] Missing pages & anchor text

2006-08-29 Thread Doug Cook
Hi Stefan, Yes, you're right. The index built without deduping does not have the first instance of the problem (though of course, it's also filled with duplicates, so it has other problems). It still shows the problems with missing redirects, though this could be something else (will investigate

[Nutch-dev] Nutch internals

2006-08-29 Thread Uroš Gruber
Hi, I do some changes in CrawlDatum but some things I'm not quite understand. My idea is to add int hop in CrawlDatum and set this in Injector to 0. Then after fetching other urls this can be calculated parenturl + 1. I try to find where adding new urls to webDB is done. If somebody could expl

Re: [Nutch-dev] Missing pages & anchor text

2006-08-29 Thread Stefan Groschupf
Hi Doug, I'm pretty sure that your problem is related to the deduping of your index. In general the hash of the content of a page is used as key for the dedub tool. We ran into the the forwarding problem also in a other case. https://issues.apache.org/jira/browse/NUTCH-353 So may be we should