Re: nutch 1.3 solrindex empty content field

Jann Forrer Mon, 19 Sep 2011 07:55:16 -0700

Hi

Thanks for your fast help.


On 09/19/2011 04:26 PM, lewis john mcgibbney wrote:

Does this solve you're problem Jann?

No, unfortunately not. I changed the content entry within the nutchschema

   runtime/local/conf/schema.xml
and the solr schema
   example/solr/conf/schema.xml
to
<field name="content" type="text" stored="true" indexed="true"/>

After that I deleted the whole crawl-directory and the solr data-directory
and try to re-index using:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldbcrawl/linkdb crawl/segments/*

But still I got no results doing a simple search. Looking at the contentfield within the solr admin

page I got:

Field Type: text

Properties: Indexed, Tokenized, Stored

Schema: Indexed, Tokenized, Stored

Position Increment Gap: 100

Index Analyzer: org.apache.solr.analysis.TokenizerChain Details

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

1. org.apache.solr.analysis.StopFilterFactory args:{words:stopwords.txt ignoreCase: true luceneMatchVersion: LUCENE_31 }2. org.apache.solr.analysis.WordDelimiterFilterFactoryargs:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 1luceneMatchVersion: LUCENE_31 generateWordParts: 1 catenateAll: 0catenateNumbers: 1 }3. org.apache.solr.analysis.LowerCaseFilterFactoryargs:{luceneMatchVersion: LUCENE_31 }4. org.apache.solr.analysis.EnglishPorterFilterFactoryargs:{protected: protwords.txt luceneMatchVersion: LUCENE_31 }5. org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactoryargs:{luceneMatchVersion: LUCENE_31 }


Query Analyzer: org.apache.solr.analysis.TokenizerChain Details

Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:


Docs: 0

BTW I did crawl http://www.rauchfrei.uzh.ch/ and try to search"Passivrauchen", a word occuring on the index page.


Jann

Is this worth filing an issue for as it is rather trivial to address but
could help more users unfamiliar with specifics of Nutch (or Solr) Schema(s)

On Mon, Sep 19, 2011 at 3:06 PM, Markus Jelsma
<[email protected]>wrote:

*previous sent by accident

On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote:

Yes, what Markus has pointed out is the problem I think Jann. This means
you need to re-index you're data and change the stored and index value to
true.

Markus', out of interest do you know the pro's/con's if we were to make
this default in the Nutch schema? For example, with small indexes I
wouldn't imagine there would be much difference, however non-trivial

sized

indexes I would imagine would be a different story...

The index size ~*2.1 depending on analyzers etc (stopwords mostly).
However,
uses that set up very large indexes are expected to be at least
intermediate
Solr users and have proper understanding of the schema.

They will toggle settings as they see fit whereas new users don't but
expect
output.

Any thoughts.

On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma

<[email protected]>wrote:

Check line 79 of your Solr schema:

http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?vi

ew=markup

Maybe we should configure the field to be stored in 1.4. I can imagine
this causes a lot of headaches for new users. Also highlighting will
never work with unstored fields.

On Monday 19 September 2011 11:02:17 Jann Forrer wrote:

Hi

I tried to run nutch-1.3 together with solr  3.x according to
http://wiki.apache.org/nutch/NutchTutorial.

That worked as described but if I try to search the index using the
Solr admin
interface i always get an empty result.

http://localhost:8983/solr/admin/schema.jsp

Using the Schema Browser I see entries in different fields (e.g. the
url field) but the content field is emtpy. I
was looking for similar problem on the mailing list but I didn't

found

a solution for this problem.

Here is what  I did:

1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
2.) Dumping the segment (./bin/nutch readseg -dump
crawl/segments/20110916124747 test). The script

       did also dump the content of the web pages. All seems to be ok

here.

3.) Copy the nutch schema.xml to the solr conf directory
4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
crawl/linkdb crawl/segments/*
5.) And then trying to search using

http://localhost:8983/solr/admin/.

but didn't found any HTML-content.

       However if there was a pdf-File to crawl, this pdf-Content is

found.

BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could
use these version but I am upgrading
from an older Nutch Version and it would be nice if I could use the
newer version where nutch and solr
are better integrated.

Any Ideas what might be wrong?

Jann

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



--
Jann Forrer
Informatikdienste
Universität Zürich
Winterthurerstr. 190
CH-8057 Zürich

oooO   mail:  [email protected]
(  )   phone: +41 44 63 56772
 \ (   fax:   +41 44 63 54505
  \_)  http://www.id.uzh.ch

Re: nutch 1.3 solrindex empty content field

Reply via email to