Re: Documents disappearing
: A quick check did show me a couple of duplicates, but if I understand : correctly, even if two different process send the same document, the last : one should update the previous. If I send the same documents 10 times, in : the end, it should only be in my index once, no? it should yes ... i didn't say i could explain your problem, i'm just trying to speculate about things that might give us insight into figureing out if/where a bug exists. the only thing i can possibly think of that would cause a situation like this (where the number of documents decreases w/o any deletes happening) is if some of the add commands use overwrite=false and some use overwrite=true ... in that situation, you might get 10 docs added with the same uniqueKey value using overwrite=false and so you'll have 10 docs in your index. then you might index one more doc with the same uniqueKey value, but this time using overwrite=true and that one document will overwrite all 10 of the previous documents, causing your doc count to decrease from 10 to 1. But nothing in your description of how you are using Solr gimplies that you were doing this, hence my question of what exactly your indexing code looks like. My best guess is that maybe the deduplication UpdateProcessors hav a bug in them, but w/o a reproducible test case demonstrating hte problem it will be nearly impossible to even know where (or if that's actaully the problem at all) -Hoss
Re: Documents disappearing
Hi, hossman wrote: : We index using 4 processes that read from a queue of documents. Each process : send one document at a time to the /update handler. Hmmm.. then you should have a message from the LogUpdateProcessorFactory for every individual add command that was recieved ... did you crunch those to see if anything odd popped up (ie: duplicated IDs) what did the start commit log messages look like? (FWIW: I have no hunches as to what caused that behavior, i'm just scrounging for more data) A quick check did show me a couple of duplicates, but if I understand correctly, even if two different process send the same document, the last one should update the previous. If I send the same documents 10 times, in the end, it should only be in my index once, no? The start commit message is always: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false) hossman wrote: : Yes, I double checked that no delete occur. Since that indexation, I : re-index the same set of documents twice and we always end up with 7725 : documents, but it did not show that ~1 documents count that we saw the : first time. But the difference between the first indexation and the others : was that the first time, the indexation last a couple of hours because the : documents were not always accessible in our document queue. The others Hmmm... what exactly does yout indexing code do when the documents aren't available? ... and what happens if you forcibly commit in the middle of reindexing (to see some of those counts again) If no document is available, the threads are sleeping. If a commit is send manually during the re-indexation, it just commit what has been sent to the index so far. I will redo the test with the same documents and in the same conditions as in our first indexation to see if the counts will be the same again. Again, thanks a lot for your help. -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27794641.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Documents disappearing
: We index using 4 processes that read from a queue of documents. Each process : send one document at a time to the /update handler. Hmmm.. then you should have a message from the LogUpdateProcessorFactory for every individual add command that was recieved ... did you crunch those to see if anything odd popped up (ie: duplicated IDs) what did the start commit log messages look like? (FWIW: I have no hunches as to what caused that behavior, i'm just scrounging for more data) : Yes, I double checked that no delete occur. Since that indexation, I : re-index the same set of documents twice and we always end up with 7725 : documents, but it did not show that ~1 documents count that we saw the : first time. But the difference between the first indexation and the others : was that the first time, the indexation last a couple of hours because the : documents were not always accessible in our document queue. The others Hmmm... what exactly does yout indexing code do when the documents aren't available? ... and what happens if you forcibly commit in the middle of reindexing (to see some of those counts again) : About the newSearcher warming query, it is a typo in the config. It should : have been 'qt'. Thanks for this one! Even if you change wt to qt that won't make the query make sense (q=*:* isn't a very useful query string when using qt=dismax) -Hoss
Re: Documents disappearing
Hoss, Thanks for your answers. You are absolutely right, I should have provided you more details. We index using 4 processes that read from a queue of documents. Each process send one document at a time to the /update handler. Yes, I double checked that no delete occur. Since that indexation, I re-index the same set of documents twice and we always end up with 7725 documents, but it did not show that ~1 documents count that we saw the first time. But the difference between the first indexation and the others was that the first time, the indexation last a couple of hours because the documents were not always accessible in our document queue. The others times, the documents were all available so it took around 20 minutes to re-index all documents. So there we no time for an auto-commit to happen during the others indexation so the log never shows the newSearcher warming query that I use as a document count. About the newSearcher warming query, it is a typo in the config. It should have been 'qt'. Thanks for this one! In my schema.xml, I have define the id ans signature fields like this: field name=id type=string indexed=true stored=true required=true / field name=signature type=string indexed=true stored=true/ ... uniqueKeyid/uniqueKey defaultSearchFieldfulltext/defaultSearchField And here is our solrconfig.xml: ?xml version=1.0 encoding=UTF-8 ? config abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults mainIndex useCompoundFilefalse/useCompoundFile ramBufferSizeMB32/ramBufferSizeMB mergeFactor10/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup /mainIndex updateHandler class=solr.DirectUpdateHandler2 !-- Perform a commit/ automatically under certain conditions: maxDocs - number of updates since last commit is greater than this maxTime - oldest uncommited update (in ms) is this long ago -- autoCommit maxDocs1/maxDocs maxTime180/maxTime /autoCommit /updateHandler query maxBooleanClauses1024/maxBooleanClauses filterCache class=solr.FastLRUCache size=1048576 initialSize=4096 autowarmCount=1024/ queryResultCache class=solr.LRUCache size=16384 initialSize=4096 autowarmCount=128/ documentCache class=solr.FastLRUCache size=1048576 initialSize=512 autowarmCount=0/ enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize50/queryResultWindowSize queryResultMaxDocsCached200/queryResultMaxDocsCached HashDocSet maxSize=3000 loadFactor=0.75/ listener event=newSearcher class=solr.QuerySenderListener arr name=queries lst str name=q*:*/str str name=sortoriginal_date desc/str /lst lst str name=q*:*/str str name=wtdismax/str /lst lst str name=q*:*/str str name=facettrue/str str name=facet.fieldsource/str str name=facet.fieldauthor/str str name=facet.fieldtype/str str name=facet.fieldsite/str /lst /arr /listener listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst str name=q*:*/str str name=sortoriginal_date desc/str /lst lst str name=q*:*/str str name=wtdismax/str /lst lst str name=q*:*/str str name=facettrue/str str name=facet.fieldsource/str str name=facet.fieldauthor/str str name=facet.fieldtype/str str name=facet.fieldsite/str /lst /arr /listener useColdSearcherfalse/useColdSearcher maxWarmingSearchers2/maxWarmingSearchers /query requestDispatcher handleSelect=true requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=2048 / httpCaching lastModifiedFrom=openTime etagSeed=Solr /httpCaching /requestDispatcher requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters --
Re: Documents disappearing
: I have encounter a situation that I can't explain. We are indexing documents : that are often duplicates so we activated deduplication like this: FWIW: w/o providing us more info about what your schema looks like, and how you are indexing documents, all we can do is speculate about some of hte possible causes of your problems -- for all we know you don't have your uniqueKey configured properly, or have something in DIH configured to do deletes on delta imports, etc... We need all the facts to make informed suggestions. : What I can't explain is that when I look at the documents count in the log, : I see documents disappearing. : : 11:24:23 INFO - [myindex] webapp=null path=null : params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0 1) it looks like you only included the newSearcher related warming query log messages in your email ... i assume you double checked that there were no delete messages logged by the LogUpdateProcessor ? 2) that's a fairly non-sensical warming query ... do you really have a queryResponseWriter registered with the name dismax (it's typically used as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what your default requestHandler declaration looks like, its totally possible that the number you are seeing has nothing to do with the totaly number of docs in your index, and instead just indicates how many docs match the litteral string *:* in your default seearch fielt (or some set of query fields if you are using dismax as the default QParser) which can certainly change as you update existing documents.. As i said: full configs would make it a lot easier to help clear up what you are seeing. -Hoss
RE: Documents disappearing
Try inspecting your index with luke Ankit -Original Message- From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] Sent: Friday, February 19, 2010 2:22 PM To: solr-user@lucene.apache.org Subject: Documents disappearing Hi, I have encounter a situation that I can't explain. We are indexing documents that are often duplicates so we activated deduplication like this: processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFieldsignature/str str name=fieldstitle,text/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor What I can't explain is that when I look at the documents count in the log, I see documents disappearing. 11:24:23 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0 14:04:24 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=4065 status=0 QTime=10 14:17:07 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=6499 status=0 QTime=42 14:25:42 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7629 status=0 QTime=1 14:47:12 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=10140 status=0 QTime=12 15:17:22 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=10861 status=0 QTime=13 15:47:31 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=9852 status=0 QTime=19 16:17:42 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=8112 status=0 QTime=13 16:38:17 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=10 16:39:10 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=1 16:47:40 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=46 16:51:24 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=74 17:02:13 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=102 17:17:41 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=8 11:24 was the time at which Solr was started that day. Around 13:30, we started the indexation. At some point during the indexation, I notice that a batch a documents were resend (i.e, documents with the same id field were sent again to the index). And according to the log, NO delete was sent to Solr. I understand that if I send duplicates (either documents with the same id or with the same signature), the count of documents should stay the same. But how can we explain that it is lowering? What are the possible causes of this behavior? Thanks! -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Documents disappearing
Using LukeRequestHandler, I see: int name=numDocs7725/int int name=maxDoc28099/int int name=numTerms758826/int long name=version1266355690710/long bool name=optimizedfalse/bool bool name=currenttrue/bool bool name=hasDeletionstrue/bool str name=directory org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index /str I will copy the index to my local machine so I can open it with luke. Should I look for something specific? Thanks! ANKITBHATNAGAR wrote: Try inspecting your index with luke Ankit -Original Message- From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] Sent: Friday, February 19, 2010 2:22 PM To: solr-user@lucene.apache.org Subject: Documents disappearing Hi, I have encounter a situation that I can't explain. We are indexing documents that are often duplicates so we activated deduplication like this: processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFieldsignature/str str name=fieldstitle,text/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor What I can't explain is that when I look at the documents count in the log, I see documents disappearing. 11:24:23 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0 14:04:24 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=4065 status=0 QTime=10 14:17:07 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=6499 status=0 QTime=42 14:25:42 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7629 status=0 QTime=1 14:47:12 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=10140 status=0 QTime=12 15:17:22 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=10861 status=0 QTime=13 15:47:31 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=9852 status=0 QTime=19 16:17:42 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=8112 status=0 QTime=13 16:38:17 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=10 16:39:10 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=1 16:47:40 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=46 16:51:24 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=74 17:02:13 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=102 17:17:41 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=8 11:24 was the time at which Solr was started that day. Around 13:30, we started the indexation. At some point during the indexation, I notice that a batch a documents were resend (i.e, documents with the same id field were sent again to the index). And according to the log, NO delete was sent to Solr. I understand that if I send duplicates (either documents with the same id or with the same signature), the count of documents should stay the same. But how can we explain that it is lowering? What are the possible causes of this behavior? Thanks! -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Documents disappearing
Pascal, Look at that difference between numDocs and maxDocs. That delta represents deleted docs. Maybe there is something deleting your docs after all! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Pascal Dimassimo thesuper...@hotmail.com To: solr-user@lucene.apache.org Sent: Fri, February 19, 2010 3:50:26 PM Subject: RE: Documents disappearing Using LukeRequestHandler, I see: 7725 28099 758826 1266355690710 false true true org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index I will copy the index to my local machine so I can open it with luke. Should I look for something specific? Thanks! ANKITBHATNAGAR wrote: Try inspecting your index with luke Ankit -Original Message- From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] Sent: Friday, February 19, 2010 2:22 PM To: solr-user@lucene.apache.org Subject: Documents disappearing Hi, I have encounter a situation that I can't explain. We are indexing documents that are often duplicates so we activated deduplication like this: class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory true true signature title,text name=signatureClassorg.apache.solr.update.processor.Lookup3Signature What I can't explain is that when I look at the documents count in the log, I see documents disappearing. 11:24:23 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=0 status=0 QTime=0 14:04:24 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=4065 status=0 QTime=10 14:17:07 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=6499 status=0 QTime=42 14:25:42 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7629 status=0 QTime=1 14:47:12 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=10140 status=0 QTime=12 15:17:22 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=10861 status=0 QTime=13 15:47:31 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=9852 status=0 QTime=19 16:17:42 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=8112 status=0 QTime=13 16:38:17 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=10 16:39:10 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=1 16:47:40 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=46 16:51:24 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=74 17:02:13 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=102 17:17:41 INFO - [myindex] webapp=null path=null params={event=newSearcherq=*:*wt=dismax} hits=7725 status=0 QTime=8 11:24 was the time at which Solr was started that day. Around 13:30, we started the indexation. At some point during the indexation, I notice that a batch a documents were resend (i.e, documents with the same id field were sent again to the index). And according to the log, NO delete was sent to Solr. I understand that if I send duplicates (either documents with the same id or with the same signature), the count of documents should stay the same. But how can we explain that it is lowering? What are the possible causes of this behavior? Thanks! -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html Sent from the Solr - User mailing list archive at Nabble.com.