Re: Documents disappearing
: A quick check did show me a couple of duplicates, but if I understand : correctly, even if two different process send the same document, the last : one should update the previous. If I send the same documents 10 times, in : the end, it should only be in my index once, no? it should yes ... i didn't say i could explain your problem, i'm just trying to speculate about things that might give us insight into figureing out if/where a bug exists. the only thing i can possibly think of that would cause a situation like this (where the number of documents decreases w/o any deletes happening) is if some of the "add" commands use overwrite="false" and some use overwrite="true" ... in that situation, you might get 10 docs added with the same uniqueKey value using overwrite="false" and so you'll have 10 docs in your index. then you might index one more doc with the same uniqueKey value, but this time using overwrite="true" and that one document will overwrite all 10 of the previous documents, causing your doc count to decrease from 10 to 1. But nothing in your description of how you are using Solr gimplies that you were doing this, hence my question of what exactly your indexing code looks like. My best guess is that maybe the deduplication UpdateProcessors hav a bug in them, but w/o a reproducible test case demonstrating hte problem it will be nearly impossible to even know where (or if that's actaully the problem at all) -Hoss
Re: Documents disappearing
Hi, hossman wrote: > > : We index using 4 processes that read from a queue of documents. Each > process > : send one document at a time to the /update handler. > > Hmmm.. then you should have a message from the LogUpdateProcessorFactory > for every individual "add" command that was recieved ... did you crunch > those to see if anything odd popped up (ie: duplicated IDs) > > what did the "start commit" log messages look like? > > (FWIW: I have no hunches as to what caused that behavior, i'm just > scrounging for more data) > A quick check did show me a couple of duplicates, but if I understand correctly, even if two different process send the same document, the last one should update the previous. If I send the same documents 10 times, in the end, it should only be in my index once, no? The "start commit" message is always: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false) hossman wrote: > > : Yes, I double checked that no delete occur. Since that indexation, I > : re-index the same set of documents twice and we always end up with 7725 > : documents, but it did not show that ~1 documents count that we saw > the > : first time. But the difference between the first indexation and the > others > : was that the first time, the indexation last a couple of hours because > the > : documents were not always accessible in our document queue. The others > > Hmmm... what exactly does yout indexing code do when the documents aren't > available? ... and what happens if you forcibly commit in the middle of > reindexing (to see some of those counts again) > If no document is available, the threads are sleeping. If a commit is send manually during the re-indexation, it just commit what has been sent to the index so far. I will redo the test with the same documents and in the same conditions as in our first indexation to see if the counts will be the same again. Again, thanks a lot for your help. -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27794641.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Documents disappearing
: We index using 4 processes that read from a queue of documents. Each process : send one document at a time to the /update handler. Hmmm.. then you should have a message from the LogUpdateProcessorFactory for every individual "add" command that was recieved ... did you crunch those to see if anything odd popped up (ie: duplicated IDs) what did the "start commit" log messages look like? (FWIW: I have no hunches as to what caused that behavior, i'm just scrounging for more data) : Yes, I double checked that no delete occur. Since that indexation, I : re-index the same set of documents twice and we always end up with 7725 : documents, but it did not show that ~1 documents count that we saw the : first time. But the difference between the first indexation and the others : was that the first time, the indexation last a couple of hours because the : documents were not always accessible in our document queue. The others Hmmm... what exactly does yout indexing code do when the documents aren't available? ... and what happens if you forcibly commit in the middle of reindexing (to see some of those counts again) : About the newSearcher warming query, it is a typo in the config. It should : have been 'qt'. Thanks for this one! Even if you change wt to qt that won't make the query make sense (q=*:* isn't a very useful query string when using qt=dismax) -Hoss
Re: Documents disappearing
Hoss, Thanks for your answers. You are absolutely right, I should have provided you more details. We index using 4 processes that read from a queue of documents. Each process send one document at a time to the /update handler. Yes, I double checked that no delete occur. Since that indexation, I re-index the same set of documents twice and we always end up with 7725 documents, but it did not show that ~1 documents count that we saw the first time. But the difference between the first indexation and the others was that the first time, the indexation last a couple of hours because the documents were not always accessible in our document queue. The others times, the documents were all available so it took around 20 minutes to re-index all documents. So there we no time for an auto-commit to happen during the others indexation so the log never shows the newSearcher warming query that I use as a document count. About the newSearcher warming query, it is a typo in the config. It should have been 'qt'. Thanks for this one! In my schema.xml, I have define the id ans signature fields like this: ... id fulltext And here is our solrconfig.xml: ${solr.abortOnConfigurationError:true} false 10 32 2147483647 1 1000 1 single false 32 10 2147483647 1 false 1 180 1024 true 50 200 *:* original_date desc *:* dismax *:* true source author type site *:* original_date desc *:* dismax *:* true source author type site false 2 explicit true 5 true true spellcheck dismax explicit 0.2 fulltext^1.2 title^2.3 text^1.2 fulltext^1.2 title^1.8 text^1.2 recip(rord(original_date),1,1,1)^100 original_date:[NOW-10DAY TO *]^2 id,title,text,author,original_date,source,section 2<100% 3<-1 4<-2 8<60% 100 *:* text features name 0 name regex true 5 true true spellcheck facetcleaner docreader queryelevation didyoumean likethis textSpell default spellchecker ./spellchecker1 standard solrpingquery all explicit true 100 70 0.5 [-\w ,/\n\"']{20,200} 5 solr ${enable.master:false} startup commit optimize ${enable.slave:false} ${slave.master.url} ${slave.poll.interval} true false signature title,text org.apache.solr.update.processor.Lookup3Signature Again, thanks for your help! hossman wrote: > > > : I have encounter a situation that I can't explain. We are indexing > documents > : that are often duplicates so we activated deduplication like this: > > FWIW: w/o providing us more info about what your schema looks like, and > how you are indexing documents, all we can do is speculate about some of > hte possible causes of your problems -- for all we know you don't have > your uniqueKey configured properly, or have something in DIH configured to > do deletes on delta imports, etc... We need all the facts to make > informed suggestions. > > : What I can't explain is that when I look at the documents count in the > log, > : I see documents disappearing. > : > : 11:24:23 INFO - [myindex] webapp=null path=null > : params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 > > 1) it looks like you only included the "newSearcher" related w
Re: Documents disappearing
: I have encounter a situation that I can't explain. We are indexing documents : that are often duplicates so we activated deduplication like this: FWIW: w/o providing us more info about what your schema looks like, and how you are indexing documents, all we can do is speculate about some of hte possible causes of your problems -- for all we know you don't have your uniqueKey configured properly, or have something in DIH configured to do deletes on delta imports, etc... We need all the facts to make informed suggestions. : What I can't explain is that when I look at the documents count in the log, : I see documents disappearing. : : 11:24:23 INFO - [myindex] webapp=null path=null : params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 1) it looks like you only included the "newSearcher" related warming query log messages in your email ... i assume you double checked that there were no "delete" messages logged by the LogUpdateProcessor ? 2) that's a fairly non-sensical warming query ... do you really have a queryResponseWriter registered with the name "dismax" (it's typically used as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what your default requestHandler declaration looks like, its totally possible that the number you are seeing has nothing to do with the totaly number of docs in your index, and instead just indicates how many docs match the litteral string "*:*" in your default seearch fielt (or some set of query fields if you are using dismax as the default QParser) which can certainly change as you update existing documents.. As i said: full configs would make it a lot easier to help clear up what you are seeing. -Hoss
Re: Documents disappearing
Pascal, Look at that difference between numDocs and maxDocs. That delta represents deleted docs. Maybe there is something deleting your docs after all! Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message > From: Pascal Dimassimo > To: solr-user@lucene.apache.org > Sent: Fri, February 19, 2010 3:50:26 PM > Subject: RE: Documents disappearing > > > Using LukeRequestHandler, I see: > > 7725 > 28099 > 758826 > 1266355690710 > false > true > true > > org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index > > > I will copy the index to my local machine so I can open it with luke. Should > I look for something specific? > > Thanks! > > > ANKITBHATNAGAR wrote: > > > > Try inspecting your index with luke > > > > > > Ankit > > > > > > -Original Message- > > From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] > > Sent: Friday, February 19, 2010 2:22 PM > > To: solr-user@lucene.apache.org > > Subject: Documents disappearing > > > > > > Hi, > > > > I have encounter a situation that I can't explain. We are indexing > > documents > > that are often duplicates so we activated deduplication like this: > > > > > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > > true > > true > > signature > > title,text > > > > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature > > > > > > What I can't explain is that when I look at the documents count in the > > log, > > I see documents disappearing. > > > > 11:24:23 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 > > 14:04:24 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10 > > 14:17:07 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42 > > 14:25:42 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1 > > 14:47:12 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12 > > 15:17:22 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13 > > 15:47:31 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19 > > 16:17:42 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13 > > 16:38:17 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10 > > 16:39:10 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1 > > 16:47:40 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46 > > 16:51:24 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74 > > 17:02:13 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102 > > 17:17:41 INFO - [myindex] webapp=null path=null > > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8 > > > > 11:24 was the time at which Solr was started that day. Around 13:30, we > > started the indexation. > > > > At some point during the indexation, I notice that a batch a documents > > were > > resend (i.e, documents with the same id field were sent again to the > > index). > > And according to the log, NO delete was sent to Solr. > > > > I understand that if I send duplicates (either documents with the same id > > or > > with the same signature), the count of documents should stay the same. But > > how can we explain that it is lowering? What are the possible causes of > > this > > behavior? > > > > Thanks! > > -- > > View this message in context: > > http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > > > > > -- > View this message in context: > http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html > Sent from the Solr - User mailing list archive at Nabble.com.
RE: Documents disappearing
Using LukeRequestHandler, I see: 7725 28099 758826 1266355690710 false true true org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index I will copy the index to my local machine so I can open it with luke. Should I look for something specific? Thanks! ANKITBHATNAGAR wrote: > > Try inspecting your index with luke > > > Ankit > > > -Original Message- > From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] > Sent: Friday, February 19, 2010 2:22 PM > To: solr-user@lucene.apache.org > Subject: Documents disappearing > > > Hi, > > I have encounter a situation that I can't explain. We are indexing > documents > that are often duplicates so we activated deduplication like this: > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> > true > true > signature > title,text >name="signatureClass">org.apache.solr.update.processor.Lookup3Signature > > > What I can't explain is that when I look at the documents count in the > log, > I see documents disappearing. > > 11:24:23 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 > 14:04:24 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10 > 14:17:07 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42 > 14:25:42 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1 > 14:47:12 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12 > 15:17:22 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13 > 15:47:31 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19 > 16:17:42 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13 > 16:38:17 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10 > 16:39:10 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1 > 16:47:40 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46 > 16:51:24 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74 > 17:02:13 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102 > 17:17:41 INFO - [myindex] webapp=null path=null > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8 > > 11:24 was the time at which Solr was started that day. Around 13:30, we > started the indexation. > > At some point during the indexation, I notice that a batch a documents > were > resend (i.e, documents with the same id field were sent again to the > index). > And according to the log, NO delete was sent to Solr. > > I understand that if I send duplicates (either documents with the same id > or > with the same signature), the count of documents should stay the same. But > how can we explain that it is lowering? What are the possible causes of > this > behavior? > > Thanks! > -- > View this message in context: > http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html > Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Documents disappearing
Try inspecting your index with luke Ankit -Original Message- From: Pascal Dimassimo [mailto:thesuper...@hotmail.com] Sent: Friday, February 19, 2010 2:22 PM To: solr-user@lucene.apache.org Subject: Documents disappearing Hi, I have encounter a situation that I can't explain. We are indexing documents that are often duplicates so we activated deduplication like this: true true signature title,text org.apache.solr.update.processor.Lookup3Signature What I can't explain is that when I look at the documents count in the log, I see documents disappearing. 11:24:23 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 14:04:24 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10 14:17:07 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42 14:25:42 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1 14:47:12 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12 15:17:22 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13 15:47:31 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19 16:17:42 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13 16:38:17 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10 16:39:10 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1 16:47:40 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46 16:51:24 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74 17:02:13 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102 17:17:41 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8 11:24 was the time at which Solr was started that day. Around 13:30, we started the indexation. At some point during the indexation, I notice that a batch a documents were resend (i.e, documents with the same id field were sent again to the index). And according to the log, NO delete was sent to Solr. I understand that if I send duplicates (either documents with the same id or with the same signature), the count of documents should stay the same. But how can we explain that it is lowering? What are the possible causes of this behavior? Thanks! -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html Sent from the Solr - User mailing list archive at Nabble.com.
Documents disappearing
Hi, I have encounter a situation that I can't explain. We are indexing documents that are often duplicates so we activated deduplication like this: true true signature title,text org.apache.solr.update.processor.Lookup3Signature What I can't explain is that when I look at the documents count in the log, I see documents disappearing. 11:24:23 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0 14:04:24 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10 14:17:07 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42 14:25:42 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1 14:47:12 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12 15:17:22 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13 15:47:31 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19 16:17:42 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13 16:38:17 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10 16:39:10 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1 16:47:40 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46 16:51:24 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74 17:02:13 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102 17:17:41 INFO - [myindex] webapp=null path=null params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8 11:24 was the time at which Solr was started that day. Around 13:30, we started the indexation. At some point during the indexation, I notice that a batch a documents were resend (i.e, documents with the same id field were sent again to the index). And according to the log, NO delete was sent to Solr. I understand that if I send duplicates (either documents with the same id or with the same signature), the count of documents should stay the same. But how can we explain that it is lowering? What are the possible causes of this behavior? Thanks! -- View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html Sent from the Solr - User mailing list archive at Nabble.com.