Re: Documents disappearing

2010-03-09 Thread Chris Hostetter

: A quick check did show me a couple of duplicates, but if I understand
: correctly, even if two different process send the same document, the last
: one should update the previous. If I send the same documents 10 times, in
: the end, it should only be in my index once, no?

it should yes ... i didn't say i could explain your problem, i'm just 
trying to speculate about things that might give us insight into figureing 
out if/where a bug exists.

the only thing i can possibly think of that would cause a situation like 
this (where the number of documents decreases w/o any deletes happening) 
is if some of the add commands use overwrite=false and some use 
overwrite=true ... in that 
situation, you might get 10 docs added with the same uniqueKey 
value using overwrite=false and so you'll have 10 docs in your index.  
then you might index one more doc with the same uniqueKey value, but this 
time using overwrite=true and that one document will overwrite all 10 of 
the previous documents, causing your doc count to decrease from 10 to 1.

But nothing in your description of how you are using Solr gimplies that 
you were doing this, hence my question of what exactly your indexing code 
looks like.

My best guess is that maybe the deduplication UpdateProcessors hav a bug 
in them, but w/o a reproducible test case demonstrating hte problem it 
will be nearly impossible to even know where (or if that's actaully the 
problem at all)



-Hoss



Re: Documents disappearing

2010-03-05 Thread Pascal Dimassimo

Hi,

hossman wrote:
 
 : We index using 4 processes that read from a queue of documents. Each
 process
 : send one document at a time to the /update handler.
 
 Hmmm.. then you should have a message from the LogUpdateProcessorFactory 
 for every individual add command that was recieved ... did you crunch 
 those to see if anything odd popped up (ie: duplicated IDs)
 
 what did the start commit log messages look like?
 
 (FWIW: I have no hunches as to what caused that behavior, i'm just 
 scrounging for more data)
 

A quick check did show me a couple of duplicates, but if I understand
correctly, even if two different process send the same document, the last
one should update the previous. If I send the same documents 10 times, in
the end, it should only be in my index once, no?

The start commit message is always:
start
commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)


hossman wrote:
 
 : Yes, I double checked that no delete occur. Since that indexation, I
 : re-index the same set of documents twice and we always end up with 7725
 : documents, but it did not show that ~1 documents count that we saw
 the
 : first time. But the difference between the first indexation and the
 others
 : was that the first time, the indexation last a couple of hours because
 the
 : documents were not always accessible in our document queue. The others
 
 Hmmm... what exactly does yout indexing code do when the documents aren't 
 available?  ... and what happens if you forcibly commit in the middle of 
 reindexing (to see some of those counts again)
 

If no document is available, the threads are sleeping. If a commit is send
manually during the re-indexation, it just commit what has been sent to the
index so far.

I will redo the test with the same documents and in the same conditions as
in our first indexation to see if the counts will be the same again.

Again, thanks a lot for your help.


-- 
View this message in context: 
http://old.nabble.com/Documents-disappearing-tp27659047p27794641.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Documents disappearing

2010-03-04 Thread Chris Hostetter
 
: We index using 4 processes that read from a queue of documents. Each process
: send one document at a time to the /update handler.

Hmmm.. then you should have a message from the LogUpdateProcessorFactory 
for every individual add command that was recieved ... did you crunch 
those to see if anything odd popped up (ie: duplicated IDs)

what did the start commit log messages look like?

(FWIW: I have no hunches as to what caused that behavior, i'm just 
scrounging for more data)

: Yes, I double checked that no delete occur. Since that indexation, I
: re-index the same set of documents twice and we always end up with 7725
: documents, but it did not show that ~1 documents count that we saw the
: first time. But the difference between the first indexation and the others
: was that the first time, the indexation last a couple of hours because the
: documents were not always accessible in our document queue. The others

Hmmm... what exactly does yout indexing code do when the documents aren't 
available?  ... and what happens if you forcibly commit in the middle of 
reindexing (to see some of those counts again)

: About the newSearcher warming query, it is a typo in the config. It should
: have been 'qt'. Thanks for this one!

Even if you change wt to qt that won't make the query make sense (q=*:* 
isn't a very useful query string when using qt=dismax)


-Hoss



Re: Documents disappearing

2010-02-24 Thread Pascal Dimassimo

Hoss,

Thanks for your answers. You are absolutely right, I should have provided
you more details. 

We index using 4 processes that read from a queue of documents. Each process
send one document at a time to the /update handler.

Yes, I double checked that no delete occur. Since that indexation, I
re-index the same set of documents twice and we always end up with 7725
documents, but it did not show that ~1 documents count that we saw the
first time. But the difference between the first indexation and the others
was that the first time, the indexation last a couple of hours because the
documents were not always accessible in our document queue. The others
times, the documents were all available so it took around 20 minutes to
re-index all documents. So there we no time for an auto-commit to happen
during the others indexation so the log never shows the newSearcher warming
query that I use as a document count. 

About the newSearcher warming query, it is a typo in the config. It should
have been 'qt'. Thanks for this one!

In my schema.xml, I have define the id ans signature fields like this:
field name=id type=string indexed=true stored=true required=true
/
field name=signature type=string indexed=true stored=true/
...
uniqueKeyid/uniqueKey
defaultSearchFieldfulltext/defaultSearchField


And here is our solrconfig.xml:
?xml version=1.0 encoding=UTF-8 ?

config
 
abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError

  indexDefaults
useCompoundFilefalse/useCompoundFile
mergeFactor10/mergeFactor
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypesingle/lockType
  /indexDefaults

  mainIndex
useCompoundFilefalse/useCompoundFile
ramBufferSizeMB32/ramBufferSizeMB
mergeFactor10/mergeFactor
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
unlockOnStartupfalse/unlockOnStartup
  /mainIndex

  updateHandler class=solr.DirectUpdateHandler2
!-- Perform a commit/ automatically under certain conditions:
 maxDocs - number of updates since last commit is greater than this
 maxTime - oldest uncommited update (in ms) is this long ago
--
autoCommit
maxDocs1/maxDocs
maxTime180/maxTime
/autoCommit
  /updateHandler


  query
maxBooleanClauses1024/maxBooleanClauses

filterCache
  class=solr.FastLRUCache
  size=1048576
  initialSize=4096
  autowarmCount=1024/

queryResultCache
  class=solr.LRUCache
  size=16384
  initialSize=4096
  autowarmCount=128/

documentCache
  class=solr.FastLRUCache
  size=1048576
  initialSize=512
  autowarmCount=0/

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize50/queryResultWindowSize
queryResultMaxDocsCached200/queryResultMaxDocsCached
HashDocSet maxSize=3000 loadFactor=0.75/

listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
lst
str name=q*:*/str
str name=sortoriginal_date desc/str
/lst
lst
str name=q*:*/str
str name=wtdismax/str
/lst
lst
str name=q*:*/str
str name=facettrue/str
str name=facet.fieldsource/str
str name=facet.fieldauthor/str
str name=facet.fieldtype/str
str name=facet.fieldsite/str
/lst
  /arr
/listener

listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lst
str name=q*:*/str
str name=sortoriginal_date desc/str
/lst
lst
str name=q*:*/str
str name=wtdismax/str
/lst
lst
str name=q*:*/str
str name=facettrue/str
str name=facet.fieldsource/str
str name=facet.fieldauthor/str
str name=facet.fieldtype/str
str name=facet.fieldsite/str
/lst
  /arr
/listener

useColdSearcherfalse/useColdSearcher
maxWarmingSearchers2/maxWarmingSearchers
  /query

  requestDispatcher handleSelect=true 
requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=2048 /
httpCaching lastModifiedFrom=openTime etagSeed=Solr
/httpCaching
  /requestDispatcher
  
  requestHandler name=standard class=solr.SearchHandler default=true
!-- default values for query parameters --
 

Re: Documents disappearing

2010-02-23 Thread Chris Hostetter

: I have encounter a situation that I can't explain. We are indexing documents
: that are often duplicates so we activated deduplication like this:

FWIW: w/o providing us more info about what your schema looks like, and 
how you are indexing documents, all we can do is speculate about some of 
hte possible causes of your problems -- for all we know you don't have 
your uniqueKey configured properly, or have something in DIH configured to 
do deletes on delta imports, etc...  We need all the facts to make 
informed suggestions.

: What I can't explain is that when I look at the documents count in the log,
: I see documents disappearing.
: 
: 11:24:23 INFO  - [myindex] webapp=null path=null
: params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0

1) it looks like you only included the newSearcher related warming query 
log messages in your email ... i assume you double checked that there were 
no delete messages logged by the LogUpdateProcessor ?

2) that's a fairly non-sensical warming query ... do you really have a 
queryResponseWriter registered with the name dismax (it's typically used 
as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what 
your default requestHandler declaration looks like, its totally possible 
that the number you are seeing has nothing to do with the totaly number of 
docs in your index, and instead just indicates how many docs match the 
litteral string *:* in your default seearch fielt (or some set of query 
fields if you are using dismax as the default QParser) which can 
certainly change as you update existing documents..

As i said: full configs would make it a lot easier to help clear up what 
you are seeing.



-Hoss



RE: Documents disappearing

2010-02-19 Thread Ankit Bhatnagar
Try inspecting your index with luke


Ankit


-Original Message-
From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] 
Sent: Friday, February 19, 2010 2:22 PM
To: solr-user@lucene.apache.org
Subject: Documents disappearing


Hi,

I have encounter a situation that I can't explain. We are indexing documents
that are often duplicates so we activated deduplication like this:

processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFieldsignature/str
  str name=fieldstitle,text/str
  str
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor

What I can't explain is that when I look at the documents count in the log,
I see documents disappearing.

11:24:23 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0
14:04:24 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=4065 status=0 QTime=10
14:17:07 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=6499 status=0 QTime=42
14:25:42 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7629 status=0 QTime=1
14:47:12 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=10140 status=0 QTime=12
15:17:22 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=10861 status=0 QTime=13
15:47:31 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=9852 status=0 QTime=19
16:17:42 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=8112 status=0 QTime=13
16:38:17 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=10
16:39:10 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=1
16:47:40 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=46
16:51:24 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=74
17:02:13 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=102
17:17:41 INFO  - [myindex] webapp=null path=null
params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=8

11:24 was the time at which Solr was started that day. Around 13:30, we
started the indexation.

At some point during the indexation, I notice that a batch a documents were
resend (i.e, documents with the same id field were sent again to the index).
And according to the log, NO delete was sent to Solr.

I understand that if I send duplicates (either documents with the same id or
with the same signature), the count of documents should stay the same. But
how can we explain that it is lowering? What are the possible causes of this
behavior?

Thanks! 
-- 
View this message in context: 
http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Documents disappearing

2010-02-19 Thread Pascal Dimassimo

Using LukeRequestHandler, I see:

int name=numDocs7725/int
int name=maxDoc28099/int
int name=numTerms758826/int
long name=version1266355690710/long
bool name=optimizedfalse/bool
bool name=currenttrue/bool
bool name=hasDeletionstrue/bool
str name=directory
org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index
/str

I will copy the index to my local machine so I can open it with luke. Should
I look for something specific?

Thanks!


ANKITBHATNAGAR wrote:
 
 Try inspecting your index with luke
 
 
 Ankit
 
 
 -Original Message-
 From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] 
 Sent: Friday, February 19, 2010 2:22 PM
 To: solr-user@lucene.apache.org
 Subject: Documents disappearing
 
 
 Hi,
 
 I have encounter a situation that I can't explain. We are indexing
 documents
 that are often duplicates so we activated deduplication like this:
 
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   bool name=overwriteDupestrue/bool
   str name=signatureFieldsignature/str
   str name=fieldstitle,text/str
   str
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
 /processor
 
 What I can't explain is that when I look at the documents count in the
 log,
 I see documents disappearing.
 
 11:24:23 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0
 14:04:24 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=4065 status=0 QTime=10
 14:17:07 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=6499 status=0 QTime=42
 14:25:42 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7629 status=0 QTime=1
 14:47:12 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=10140 status=0 QTime=12
 15:17:22 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=10861 status=0 QTime=13
 15:47:31 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=9852 status=0 QTime=19
 16:17:42 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=8112 status=0 QTime=13
 16:38:17 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=10
 16:39:10 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=1
 16:47:40 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=46
 16:51:24 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=74
 17:02:13 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=102
 17:17:41 INFO  - [myindex] webapp=null path=null
 params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=8
 
 11:24 was the time at which Solr was started that day. Around 13:30, we
 started the indexation.
 
 At some point during the indexation, I notice that a batch a documents
 were
 resend (i.e, documents with the same id field were sent again to the
 index).
 And according to the log, NO delete was sent to Solr.
 
 I understand that if I send duplicates (either documents with the same id
 or
 with the same signature), the count of documents should stay the same. But
 how can we explain that it is lowering? What are the possible causes of
 this
 behavior?
 
 Thanks! 
 -- 
 View this message in context:
 http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Documents disappearing

2010-02-19 Thread Otis Gospodnetic
Pascal,

Look at that difference between numDocs and maxDocs.  That delta represents 
deleted docs.  Maybe there is something deleting your docs after all!

Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Pascal Dimassimo thesuper...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Fri, February 19, 2010 3:50:26 PM
 Subject: RE: Documents disappearing
 
 
 Using LukeRequestHandler, I see:
 
 7725
 28099
 758826
 1266355690710
 false
 true
 true
 
 org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index
 
 
 I will copy the index to my local machine so I can open it with luke. Should
 I look for something specific?
 
 Thanks!
 
 
 ANKITBHATNAGAR wrote:
  
  Try inspecting your index with luke
  
  
  Ankit
  
  
  -Original Message-
  From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] 
  Sent: Friday, February 19, 2010 2:22 PM
  To: solr-user@lucene.apache.org
  Subject: Documents disappearing
  
  
  Hi,
  
  I have encounter a situation that I can't explain. We are indexing
  documents
  that are often duplicates so we activated deduplication like this:
  
  
  class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   true
   true
   signature
   title,text
   
  name=signatureClassorg.apache.solr.update.processor.Lookup3Signature
  
  
  What I can't explain is that when I look at the documents count in the
  log,
  I see documents disappearing.
  
  11:24:23 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0
  14:04:24 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=4065 status=0 QTime=10
  14:17:07 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=6499 status=0 QTime=42
  14:25:42 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7629 status=0 QTime=1
  14:47:12 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=10140 status=0 QTime=12
  15:17:22 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=10861 status=0 QTime=13
  15:47:31 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=9852 status=0 QTime=19
  16:17:42 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=8112 status=0 QTime=13
  16:38:17 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=10
  16:39:10 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=1
  16:47:40 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=46
  16:51:24 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=74
  17:02:13 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=102
  17:17:41 INFO  - [myindex] webapp=null path=null
  params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=8
  
  11:24 was the time at which Solr was started that day. Around 13:30, we
  started the indexation.
  
  At some point during the indexation, I notice that a batch a documents
  were
  resend (i.e, documents with the same id field were sent again to the
  index).
  And according to the log, NO delete was sent to Solr.
  
  I understand that if I send duplicates (either documents with the same id
  or
  with the same signature), the count of documents should stay the same. But
  how can we explain that it is lowering? What are the possible causes of
  this
  behavior?
  
  Thanks! 
  -- 
  View this message in context:
  http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
  Sent from the Solr - User mailing list archive at Nabble.com.
  
  
  
 
 -- 
 View this message in context: 
 http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html
 Sent from the Solr - User mailing list archive at Nabble.com.