Re: [Nutch-general] New to Nutch, a few questions

Nes Yarug Wed, 31 Jan 2007 02:59:03 -0800

Thank you everyone for your replies.

I have implemented the recrawl script from
http://wiki.apache.org/nutch/IntranetRecrawl and that is still running for
over 12 hours so I guess that  would index much more pages.


Leaves the question about language specific search. I have tried adding the
lang: clause to my search query by appending lang:en but that is not
returning any results (as if lang:en would become part of the actual query).
The url then looks like this: search.jsp
?query=help+lang%3Aen&hitsPerPage=10&lang=en

Anyone has used a language specific search before, do I need to add a new
(hidden) input field on the search form to specifiy the language instead of
appending it to the query? That would be my preference anyway, as I want the
language specific search to be transparant to he user.

Again, many thanks for any replies,
Nes

On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:


Nes Yarug wrote:
> Hi all,
>
> I'm new to Nutch and I have a few questions that I hope to get some
> answers
> on. Thanks in advance for any replies.
>
> I want to use Nutch to index a web site I'm maintaining. I've followed
> the
> tutorial for intranet crawling and used a list of links (17420 links
> to 8710
> pages, each page has two unique links) from my site to crawl initially.
Actually, you don't need to provide a full list of links to Nutch. You
can let it discover links as it crawl your site, and constrain them
using crawl-urlfilter.txt and regex-urlfilter.txt
> The
> command I used was:
>
> bin/nutch crawl urls -dir crawl -depth 20 -topN 100
>
> The crawl completed, but I'm sure that when I was testing the search
> it has
> not indexed a lot of pages. What I understand from the following
> command it
> only indexed 1527 of 21378 pages:
>
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     21378
> retry 0:        20878
> retry 1:        487
> retry 2:        10
> retry 3:        3
> min score:      0.014
> avg score:      84.405266
> max score:      37106.03
> status 1 (DB_unfetched):        19848
> status 2 (DB_fetched):  1527
> status 3 (DB_gone):     3
> CrawlDb statistics: done
>
>
> Now my questions:
>
> 1) Will Nutch automatically continue to index the rest of the URLs even
> though te initial crawl finished (through some internal scheduler of
some
> sorts)?
You will need to refetch, or better: increase the depth, until "all your
pages" are fetched.
>
> 2) All of my site's pages at the moment are contained in two languages
> (each
> page has exactly two languages, the lang attribute on the html tag of
> each
> page contains the language identifier). When searching, is there a way
to
> only return pages in a specific language? I know the Nutch UI is
> localised,
> but it will still return pages in english if my UI language is German
for
> example. I want it to return German pages only (<html lang="de">) when
> searching through the German UI. Is that possible?
try using "lang:" in your query, I'm not sure it's working, though...
From the javadoc: "LanguageQueryFilter.java should handles "lang:"
query clauses, causing them to search the "lang" field indexed by
LanguageIdentifier" (see also LanguageIndexingFilter.java).

HTH,
Renaud


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] New to Nutch, a few questions

Reply via email to