Re: Data not always returned

2011-06-08 Thread Jerome Renard
Hi Erick

On Tue, Jun 7, 2011 at 11:42 PM, Erick Erickson erickerick...@gmail.com wrote:
 Well, this is odd. Several questions

 1 what do your logs show? I'm wondering if somehow some data is getting
     rejected. I have no idea why that would be, but if you're seeing indexing
     exceptions that would explain it.
 2 on the admin/stats page, are maxDocs and numDocs the same in the success
     /failure case? And are they equal to 40,000?
 3 what does debugQuery=on show in the two cases? I'd expect it to be
 identical, but...
 4 admin/schema browser. Look at your three fields and see if things
 like unique-terms are
     identical.
 5 are the rows being returned before indexing in the same order? I'm
 wondering if somehow
     you're getting documents overwritten by having the same id (uniqueKey).
 6 Have you poked around with Luke to see what, if anything, is dissimilar?

 These are shots in the dark, but my supposition is that somehow you're
 not indexing what
 you expect, the questions above might give us a clue where to look next.


You were right, I found a nasty problem with the indexer and postgres which
prevented some documents to be indexed. Once I fixed this problem everything
worked fine.

Thanks a lot for your support.

Best Regards,

-- 
Jérôme


Data not always returned

2011-06-07 Thread Jerome Renard
Hi all,

I have a problem with my index. Even though I always index the same
data over and over again, whenever I try
a couple of searches (they are always the same as they are issued by a
unit test suite) I do not get the same
results, sometimes I get 3 successes and 2 failures and sometimes it
is the other way around it is unpredictable.

Here is what I am trying to do:

I created a new Solr core with its specific solrconfig.xml and schema.xml
This core stores a list of towns which I plan to use with an
auto-suggestion system, using ngrams (no Suggester)

The indexing process is always the same :
1. the import script deletes all documents in the core :
deletequery*:*/query/delete and commit/
2. the import script fetches date from postgres, 100 rows at a time
2. the import script adds these 100 documents and sends a commit/
3. once all the rows (around 40 000) have been imported the script
send an optimize/ query

Here is what happens:
I run the indexer once and search for 'foo' I get results I expect but
if I search for 'bar' I get nothing
I reindex once again and search for 'foo' I get nothing, but if I
search for 'bar' I get results
The search is made on the name field which is a pretty common
TextField with ngrams.

I tried to physically remove the index (rm -rf path/to/index) and
reindex everything as well and
not all searches work, sometimes the 'foo' search work, sometimes the 'bar' one.

I tried a lot of differents things but now I am running out of ideas.
This is why I am asking for help.

Some useful informations :
Solr version : 3.1.0
Solr Implementation Version: 3.1.0 1085815 - grantingersoll -
2011-03-26 18:00:07
Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
Java 1.5.0_24 on Mac Os X
solrconfig.xml and schema.xml are attached

Thanks in advance for your help.


schema.xml.gz
Description: GNU Zip compressed data


solrconfig.xml.gz
Description: GNU Zip compressed data


Re: Weird behaviour with phrase queries

2011-01-26 Thread Jerome Renard
Hi Erick,

On Tue, Jan 25, 2011 at 1:38 PM, Erick Erickson erickerick...@gmail.comwrote:

 Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
 analysis page sometimes is a bit misleading, so beware of that.

 But the output of your queries make it look like the query is parsing as
 you
 expect, which leaves the question of whether your index contains what
 you think it does. You might get a copy of Luke, which allows you to
 examine
 what's actually in your index instead of what you think is in there.
 Sometimes
 there are surprises here!


Bingo ! Some data were not in the index. Indexing them obviously fixed the
problem.


 I didn't mean to re-index your whole corpus, I was thinking that you could
 just index a few documents in a test index so you have something small to
 look at.

 Sorry I can't spot what's happening right away.


No worries, thanks for your support :)

-- 
Jérôme


Weird behaviour with phrase queries

2011-01-24 Thread Jerome Renard
Hi,

I have a problem with phrase queries, from times to times I do not get any
result
where as I know I should get returned something.

The search is run against a field of type text which definition is
available at the following URL :
- http://pastebin.com/Ncem7M8z

This field is defined with the following configuration:
field name=meta_text type=textindexed=true  stored=true
multiValued=true termVectors=true/

I use the following request handler:
requestHandler name=custom class=solr.DisMaxRequestHandler
lst name=defaults
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qfmeta_text/str
str name=pfmeta_text/str
str name=bf/
str name=mm1lt;1 2lt;-1 5lt;-2 7lt;60%/str
int name=ps100/int
str name=q.alt*:*/str
/lst
/requestHandler

Depending on the kind of phrase query I use I get either exactly what I am
looking for or nothing.

Index' contents is all french so I thought about a possible problem with
accents but I got queries working
with phrase queries containing é and è chars like académie or
ingénieur.

As you will see the filter used in the text type uses the
SnowballPorterFilterFactory for the english language,
I plan to fix that by using the correct language for the index (French) and
the following protwords http://bit.ly/i8JeX6 .

But except this mistake with the stemmer, did I do something (else) wrong ?
Did I overlook something ? What could
explain I do not always get results for my phrase queries ?

Thanks in advance for your feedback.

Best Regards,

--
Jérôme


Re: Weird behaviour with phrase queries

2011-01-24 Thread Jerome Renard
Erick,

On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm, I don't see any screen shots. Several things:
 1 If your stopword file has comments, I'm not sure what the effect would
 be.


Ha, I thought comments were supported in stopwords.txt


 2 Something's not right here, or I'm being fooled again. Your withresults
 xml has this line:
 str name=parsedquery+DisjunctionMaxQuery((meta_text:ecol d
 ingenieur)~0.01) ()/str
 and your noresults has this line:
 str name=parsedquery+DisjunctionMaxQuery((meta_text:academi
 charpenti)~0.01) DisjunctionMaxQuery((meta_text:academi
 charpenti~100)~0.01)/str

 the empty () in the first one often means you're NOT going to your
 configured dismax parser in solrconfig.xml. Yet that doesn't square with
 your custom qt, so I'm puzzled.

 Could we see your raw query string on the way in? It's almost as if you
 defined qt in one and defType in the other, which are not equivalent.


You are right I fixed this problem (my bad).

3 It may take 12 hours to index, but you could experiment with a smaller
 subset. You say you know that the noresults one should return documents,
 what proof do
 you have? If there's a single document that you know should match this,
 just
 index it and a few others and you should be able to make many runs until
 you
 get
 to the bottom of this...


I could but I always thought I had to fully re-index after updating
schema.xml. If
I update only few documents will that take the changes into account without
breaking
the rest ?


 And obviously your stemming is happening on the query, are you sure it's
 happening at index time too?


Since you did not get the screenshots you will find attached the full output
of the analysis
for a phrase that works and for another that does not.

Thanks for your support

Best Regards,

--
Jérôme


analysis-noresults.html.gz
Description: GNU Zip compressed data


analysis-withresults.html.gz
Description: GNU Zip compressed data