As Zaheed pointed out, "You need to activate index-more and query-more plugin in nutch-site.xml"
So, copy the entry "plugin.includes" from nutch-defaults.xml, add index-more and query-lang, and insert it in your nutch-site.xml. You should have something like this: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|more)|query-(basic|site|url|lang)|summary-basic|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> HTH, Renaud Nes Yarug wrote: > Oops, my previous post should read "I have NOT explicitely activated > those > plugins" > > On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote: >> >> I have explicitely activated those plugins. Could you tell me how to do >> that with an example as I looked through conf/nutch-default.xml and >> couldn't find any references to it. I'm using 0.8.1 by the way. They are >> enabled in the build I guess as default.properties is listing them: >> >> # >> # Indexing Filter Plugins >> # >> plugins.index=\ >> org.apache.nutch.indexer.basic*:\ >> org.apache.nutch.indexer.more* >> >> # >> # Query Filter Plugins >> # >> plugins.query=\ >> org.apache.nutch.searcher.basic*:\ >> org.apache.nutch.searcher.more*:\ >> org.apache.nutch.searcher.site*:\ >> org.apache.nutch.searcher.url* >> >> Many thanks, >> Nes >> >> On 1/31/07, Zaheed Haque <[EMAIL PROTECTED]> wrote: >> > >> > Unless you haven't yet.. You need to activate index-more and >> > query-more plugin in nutch-site.xml >> > >> > You can also check the "explan link" from the search results page and >> > you will see "lang" is missing if you haven't activated the index-more >> > and query-more plugin.. >> > >> > Cheers >> > >> > On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote: >> > > Thank you everyone for your replies. >> > > >> > > I have implemented the recrawl script from >> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still >> running >> > for >> > > over 12 hours so I guess that would index much more pages. >> > > >> > > Leaves the question about language specific search. I have tried >> > adding the >> > > lang: clause to my search query by appending lang:en but that is not >> > > returning any results (as if lang:en would become part of the actual >> > query). >> > > The url then looks like this: search.jsp >> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en >> > > >> > > Anyone has used a language specific search before, do I need to >> add a >> > new >> > > (hidden) input field on the search form to specifiy the language >> > instead of >> > > appending it to the query? That would be my preference anyway, as I >> > want the >> > > language specific search to be transparant to he user. >> > > >> > > Again, many thanks for any replies, >> > > Nes >> > > >> > > On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote: >> > > > >> > > > Nes Yarug wrote: >> > > > > Hi all, >> > > > > >> > > > > I'm new to Nutch and I have a few questions that I hope to get >> > some >> > > > > answers >> > > > > on. Thanks in advance for any replies. >> > > > > >> > > > > I want to use Nutch to index a web site I'm maintaining. I've >> > followed >> > > > > the >> > > > > tutorial for intranet crawling and used a list of links (17420 >> > links >> > > > > to 8710 >> > > > > pages, each page has two unique links) from my site to crawl >> > initially. >> > > > Actually, you don't need to provide a full list of links to Nutch. >> > You >> > > > can let it discover links as it crawl your site, and constrain >> them >> > > > using crawl-urlfilter.txt and regex-urlfilter.txt >> > > > > The >> > > > > command I used was: >> > > > > >> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100 >> > > > > >> > > > > The crawl completed, but I'm sure that when I was testing the >> > search >> > > > > it has >> > > > > not indexed a lot of pages. What I understand from the following >> > > > > command it >> > > > > only indexed 1527 of 21378 pages: >> > > > > >> > > > > CrawlDb statistics start: crawl/crawldb >> > > > > Statistics for CrawlDb: crawl/crawldb >> > > > > TOTAL urls: 21378 >> > > > > retry 0: 20878 >> > > > > retry 1: 487 >> > > > > retry 2: 10 >> > > > > retry 3: 3 >> > > > > min score: 0.014 >> > > > > avg score: 84.405266 >> > > > > max score: 37106.03 >> > > > > status 1 (DB_unfetched): 19848 >> > > > > status 2 (DB_fetched): 1527 >> > > > > status 3 (DB_gone): 3 >> > > > > CrawlDb statistics: done >> > > > > >> > > > > >> > > > > Now my questions: >> > > > > >> > > > > 1) Will Nutch automatically continue to index the rest of the >> URLs >> > even >> > > > > though te initial crawl finished (through some internal >> scheduler >> > of >> > > > some >> > > > > sorts)? >> > > > You will need to refetch, or better: increase the depth, until >> "all >> > your >> > > > pages" are fetched. >> > > > > >> > > > > 2) All of my site's pages at the moment are contained in two >> > languages >> > > > > (each >> > > > > page has exactly two languages, the lang attribute on the >> html tag >> > of >> > > > > each >> > > > > page contains the language identifier). When searching, is >> there a >> > way >> > > > to >> > > > > only return pages in a specific language? I know the Nutch UI is >> > > > > localised, >> > > > > but it will still return pages in english if my UI language is >> > German >> > > > for >> > > > > example. I want it to return German pages only (<html >> lang="de">) >> > when >> > > > > searching through the German UI. Is that possible? >> > > > try using "lang:" in your query, I'm not sure it's working, >> > though... >> > > > From the javadoc: "LanguageQueryFilter.java should handles "lang:" >> > > > query clauses, causing them to search the "lang" field indexed by >> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java). >> > > > >> > > > HTH, >> > > > Renaud >> > > > >> > > > >> > > > -- >> > > > renaud richardet +1 617 230 9112 >> > > > renaud <at> oslutions.com http://www.oslutions.com >> > > > >> > > > >> > > >> > > >> > >> >> > -- renaud richardet +1 617 230 9112 renaud <at> oslutions.com http://www.oslutions.com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
