Re: [Nutch-general] New to Nutch, a few questions

Renaud Richardet Wed, 31 Jan 2007 11:37:19 -0800

As Zaheed pointed out, "You need to activate index-more and query-more 
plugin in nutch-site.xml"


So, copy the entry "plugin.includes" from nutch-defaults.xml, add 
index-more and query-lang, and insert it in your nutch-site.xml. You 
should have something like this:

<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|more)|query-(basic|site|url|lang)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

HTH,
Renaud


Nes Yarug wrote:
> Oops, my previous post should read "I have NOT explicitely activated 
> those
> plugins"
>
> On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:
>>
>> I have explicitely activated those plugins. Could you tell me how to do
>> that with an example as I looked through conf/nutch-default.xml and
>> couldn't find any references to it. I'm using 0.8.1 by the way. They are
>> enabled in the build I guess as default.properties is listing them:
>>
>> #
>> # Indexing Filter Plugins
>> #
>> plugins.index=\
>>    org.apache.nutch.indexer.basic*:\
>>    org.apache.nutch.indexer.more*
>>
>> #
>> # Query Filter Plugins
>> #
>> plugins.query=\
>>    org.apache.nutch.searcher.basic*:\
>>    org.apache.nutch.searcher.more*:\
>>    org.apache.nutch.searcher.site*:\
>>    org.apache.nutch.searcher.url*
>>
>> Many thanks,
>> Nes
>>
>> On 1/31/07, Zaheed Haque <[EMAIL PROTECTED]> wrote:
>> >
>> > Unless you haven't yet.. You need to activate index-more and
>> > query-more plugin in nutch-site.xml
>> >
>> > You can also check the "explan link"  from the search results page and
>> > you will see "lang" is missing if you haven't activated the index-more
>> > and query-more plugin..
>> >
>> > Cheers
>> >
>> > On 1/31/07, Nes Yarug <[EMAIL PROTECTED]> wrote:
>> > > Thank you everyone for your replies.
>> > >
>> > > I have implemented the recrawl script from
>> > > http://wiki.apache.org/nutch/IntranetRecrawl and that is still 
>> running
>> > for
>> > > over 12 hours so I guess that  would index much more pages.
>> > >
>> > > Leaves the question about language specific search. I have tried
>> > adding the
>> > > lang: clause to my search query by appending lang:en but that is not
>> > > returning any results (as if lang:en would become part of the actual
>> > query).
>> > > The url then looks like this: search.jsp
>> > > ?query=help+lang%3Aen&hitsPerPage=10&lang=en
>> > >
>> > > Anyone has used a language specific search before, do I need to 
>> add a
>> > new
>> > > (hidden) input field on the search form to specifiy the language
>> > instead of
>> > > appending it to the query? That would be my preference anyway, as I
>> > want the
>> > > language specific search to be transparant to he user.
>> > >
>> > > Again, many thanks for any replies,
>> > > Nes
>> > >
>> > > On 1/30/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
>> > > >
>> > > > Nes Yarug wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > I'm new to Nutch and I have a few questions that I hope to get
>> > some
>> > > > > answers
>> > > > > on. Thanks in advance for any replies.
>> > > > >
>> > > > > I want to use Nutch to index a web site I'm maintaining. I've
>> > followed
>> > > > > the
>> > > > > tutorial for intranet crawling and used a list of links (17420
>> > links
>> > > > > to 8710
>> > > > > pages, each page has two unique links) from my site to crawl
>> > initially.
>> > > > Actually, you don't need to provide a full list of links to Nutch.
>> > You
>> > > > can let it discover links as it crawl your site, and constrain 
>> them
>> > > > using crawl-urlfilter.txt and regex-urlfilter.txt
>> > > > > The
>> > > > > command I used was:
>> > > > >
>> > > > > bin/nutch crawl urls -dir crawl -depth 20 -topN 100
>> > > > >
>> > > > > The crawl completed, but I'm sure that when I was testing the
>> > search
>> > > > > it has
>> > > > > not indexed a lot of pages. What I understand from the following
>> > > > > command it
>> > > > > only indexed 1527 of 21378 pages:
>> > > > >
>> > > > > CrawlDb statistics start: crawl/crawldb
>> > > > > Statistics for CrawlDb: crawl/crawldb
>> > > > > TOTAL urls:     21378
>> > > > > retry 0:        20878
>> > > > > retry 1:        487
>> > > > > retry 2:        10
>> > > > > retry 3:        3
>> > > > > min score:      0.014
>> > > > > avg score:       84.405266
>> > > > > max score:      37106.03
>> > > > > status 1 (DB_unfetched):        19848
>> > > > > status 2 (DB_fetched):  1527
>> > > > > status 3 (DB_gone):     3
>> > > > > CrawlDb statistics: done
>> > > > >
>> > > > >
>> > > > > Now my questions:
>> > > > >
>> > > > > 1) Will Nutch automatically continue to index the rest of the 
>> URLs
>> > even
>> > > > > though te initial crawl finished (through some internal 
>> scheduler
>> > of
>> > > > some
>> > > > > sorts)?
>> > > > You will need to refetch, or better: increase the depth, until 
>> "all
>> > your
>> > > > pages" are fetched.
>> > > > >
>> > > > > 2) All of my site's pages at the moment are contained in two
>> > languages
>> > > > > (each
>> > > > > page has exactly two languages, the lang attribute on the 
>> html tag
>> > of
>> > > > > each
>> > > > > page contains the language identifier). When searching, is 
>> there a
>> > way
>> > > > to
>> > > > > only return pages in a specific language? I know the Nutch UI is
>> > > > > localised,
>> > > > > but it will still return pages in english if my UI language is
>> > German
>> > > > for
>> > > > > example. I want it to return German pages only (<html 
>> lang="de">)
>> > when
>> > > > > searching through the German UI. Is that possible?
>> > > > try using "lang:" in your query, I'm not sure it's working,
>> > though...
>> > > > From the javadoc: "LanguageQueryFilter.java should handles "lang:"
>> > > > query clauses, causing them to search the "lang" field indexed by
>> > > > LanguageIdentifier" (see also LanguageIndexingFilter.java).
>> > > >
>> > > > HTH,
>> > > > Renaud
>> > > >
>> > > >
>> > > > --
>> > > > renaud richardet                           +1 617 230 9112
>> > > > renaud <at> oslutions.com         http://www.oslutions.com
>> > > >
>> > > >
>> > >
>> > >
>> >
>>
>>
>


-- 
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] New to Nutch, a few questions

Reply via email to