Recrawling in nutch 2.x

2014-05-24 Thread Ali rahmani
Dear Guys, we are working on search engine ,and we have to juest version 2.x(due to its ability to connect to HBASE). we tired tens of re-crawling scripts but non of them works. Is there any re-crawling scrips for nutch 2.x.  We also added "db.fetch.interval.default" to "nutch-site.xml" file but

Re: Importance of Score

2014-05-24 Thread Sebastian Nagel
Hi Vangelis, > Cons: Scoring is not used for selection Domains (hosts) at the start of a > region > (mapper input) have the highest chance to get selected. > > I guess that the first line is wrong and should be updated. Afaics, that belongs to section "Things for future development", resp. "Sug

Re: Indexing Metatags

2014-05-24 Thread Sebastian Nagel
Hi Michael, does it work if metatags in "index.parse.md" are lowercased? index.parse.md metatag.groupsallowed,metatag.gtitle See https://issues.apache.org/jira/browse/NUTCH-1561 Sorry, that's an open issue for one year now. If you find time to review the patch, would be great! Thanks, Sebas

Nutch fetch local files with arbitrary mapped URLs

2014-05-24 Thread Martin Aesch
Hi all, I have a bunch of HTML files sitting in my file system. I know the http:// URL of each html file. If I just fetch from my file system, I will have file:// urls, but I would like to map them to the http:// adress or to any arbitrary adress. Is there any halfway non-hackish possibility f