Nes Yarug wrote:
> Hi all,
>
> I'm new to Nutch and I have a few questions that I hope to get some 
> answers
> on. Thanks in advance for any replies.
>
> I want to use Nutch to index a web site I'm maintaining. I've followed 
> the
> tutorial for intranet crawling and used a list of links (17420 links 
> to 8710
> pages, each page has two unique links) from my site to crawl initially. 
Actually, you don't need to provide a full list of links to Nutch. You 
can let it discover links as it crawl your site, and constrain them 
using crawl-urlfilter.txt and regex-urlfilter.txt
> The
> command I used was:
>
> bin/nutch crawl urls -dir crawl -depth 20 -topN 100
>
> The crawl completed, but I'm sure that when I was testing the search 
> it has
> not indexed a lot of pages. What I understand from the following 
> command it
> only indexed 1527 of 21378 pages:
>
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     21378
> retry 0:        20878
> retry 1:        487
> retry 2:        10
> retry 3:        3
> min score:      0.014
> avg score:      84.405266
> max score:      37106.03
> status 1 (DB_unfetched):        19848
> status 2 (DB_fetched):  1527
> status 3 (DB_gone):     3
> CrawlDb statistics: done
>
>
> Now my questions:
>
> 1) Will Nutch automatically continue to index the rest of the URLs even
> though te initial crawl finished (through some internal scheduler of some
> sorts)?
You will need to refetch, or better: increase the depth, until "all your 
pages" are fetched.
>
> 2) All of my site's pages at the moment are contained in two languages 
> (each
> page has exactly two languages, the lang attribute on the html tag of 
> each
> page contains the language identifier). When searching, is there a way to
> only return pages in a specific language? I know the Nutch UI is 
> localised,
> but it will still return pages in english if my UI language is German for
> example. I want it to return German pages only (<html lang="de">) when
> searching through the German UI. Is that possible?
try using "lang:" in your query, I'm not sure it's working, though...
 From the javadoc: "LanguageQueryFilter.java should handles "lang:" 
query clauses, causing them to search the "lang" field indexed by 
LanguageIdentifier" (see also LanguageIndexingFilter.java).

HTH,
Renaud


-- 
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to