Re: delta index produces multiple results?

2009-01-15 Thread Chris Hostetter

: Full index is working fine, in schema.xml I implemented a uniqueKey field
: (which is of the type 'text').

using text as the fieldtype for a uniqueKey is almost never a good idea.  
it could easily explain the behavior you are seeing.

DataImportHandler (and all of hte update handlers) relies on the 
underlying UpdateProcessor to delete docs with identical uniqueKeys when 
you update an existing document ... if the uniqueKey field has an 
analyzer that produces multiple tokens (TextField frequently does) then 
the behavior becomes undefined.

stick something like StrField, or IntField for your uniqueKeyField ... or 
if you must use TextField make sure you are using the KeywordTokenizer.

if changing this still causes problems, then we'll need to see your 
schema.xml your data-config.xml, and the output of doing a search 
where you get some duplicaitons like this to help figure out what else 
might be going wrong.


-Hoss



delta index produces multiple results?

2009-01-06 Thread Christoph Notarp

Hi,

I use the DIH with RDBMS for indexing a large mysql database with  
about 7 mill. entries.
Full index is working fine, in schema.xml I implemented a uniqueKey  
field (which is of the type 'text').


I start queries with the dismax query handler, and get my results as  
an php array.


Now, since the database entries change every second, I use the delta  
query property to
a) delete documents from the index that have been deleted in the  
database (there´s a table for deleted items) and
b) update documents in the index that have changed since the last  
index (there´s a last_modified-column in a table for that).


From my understanding, when I start a delta-import, the DIH checks  
the deletedPkQuery first and deletes the documents that should be  
deleted (identified by the uniqueKey-field?).
Seems to work - the catalina.out says INFO: deleted from document to  
Solr: 1851010 for example.
Next thing would be the deltaQuery. This seems to work, too - when  
finished, a query returns the new database entries.

But (and here comes the problem):
The dataimport status always says Added / Changed x-hundred  
documents, deleted 0 documents - no deletes?
Everytime I change an item in the database, and do a delta-import  
after that, my next query will return that item *twice*.
After the next change and next delta-import solr will return *three*  
result documents, and so on.
As I mentioned before, I get my search results as an array, consisting  
of many arrays (= solr documents) with the fields I set in schema.xml.
After changing some documents and delta-indexing them, I get lots of  
identical arrays (even the uniqueKey-field is absolutely identical).


I have read somewhere in the wiki, that an update is a delete of the  
old document plus a new document.
I guess the problem could be that something fails with the delete- 
process, but I don´t have a clue why.


Any ideas?

Thanks in advance
Chris