Re: Odd pattern of segment creation - branch_4x bug?
On 12/3/2012 9:46 PM, Yonik Seeley wrote: On Mon, Dec 3, 2012 at 11:36 PM, Shawn Heisey s...@elyograg.org wrote: updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs65536/maxDocs maxTime30/maxTime /autoCommit updateLog / /updateHandler Yeah, seems like that just should generate hard commits. Are you using delete by querys at all? The current implementation can do an NRT reopen as part of it's implementation. When I do deletes, it's done exclusively using deleteByQuery in SolrJ, with something like did:(1000 OR 1001 OR 1002 OR 1003) as the query. Before doing the delete, I do a search using the same query and only proceed to do the delete on that shard if numFound from the query is more than zero. I don't know if this is necessary information, but no deletes are happening while I do my full-import. Nothing else happens to the index at all while DIH is doing its thing. This is new experimentation with autocommit. By using autocommit, the commit at the end of an import is much faster. When autocommit is turned off, it is several minutes between the last document add and the handler returning to idle status. I like this new faster commit, but if I can't get an autocommit to only produce one segment, I may go back to the old way, to reduce the I/O impact from the more frequent merges. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Odd pattern of segment creation - branch_4x bug?
I'm using Solr compiled from a branch_4x checkout. solr-impl4.1-SNAPSHOT 1416639M - ncindex - 2012-12-03 12:54:38 I've noticed something really odd happening during DIH full-import of millions of documents, and I'm wondering if it's a bug. Configbits that I think may be relevant are below. If you'd like more information, please let me know what you'd like and whether I need to turn on settings like infostream and do another import: Autocommit is set to maxDocs 65536 docs and maxTime 30. ramBufferSizeMB is 100. updateLog is enabled, no options. What's happening is that whenever it hits maxDocs, I get 2 segment files, one of them significantly smaller than the other. Rarely, it creates 3 segments! I know it's not a ramBuffer problem, because initially the exact same thing was happening with maxDocs at 10 and a 32MB ramBuffer. I raised the ramBuffer and lowered the maxDocs. It takes significantly less than 5 minutes maxDocs to get indexed, so the maxTime value should not be a factor. Sometimes the last segment is incomplete until the next autocommit, consisting only of files like the following. On the next autocommit, the incomplete segment is completed. -rw-r--r-- 1 ncindex ncindex 411 Dec 3 14:22 _fu.si -rw-r--r-- 1 ncindex ncindex 55966 Dec 3 14:22 _fu_Lucene41_0.tip -rw-r--r-- 1 ncindex ncindex 1983125 Dec 3 14:22 _fu_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 1720492 Dec 3 14:22 _fu_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex 1384931 Dec 3 14:22 _fu_Lucene41_0.doc Sometimes the last segment does get written completely before the next autocommit. I have no idea what makes things happen differently sometimes: -rw-r--r-- 1 ncindex ncindex144497 Dec 3 14:16 _fq.tvx -rw-r--r-- 1 ncindex ncindex 6106209 Dec 3 14:16 _fq.tvf -rw-r--r-- 1 ncindex ncindex 18090 Dec 3 14:16 _fq.tvd -rw-r--r-- 1 ncindex ncindex 411 Dec 3 14:16 _fq.si -rw-r--r-- 1 ncindex ncindex 67683 Dec 3 14:16 _fq_Lucene41_0.tip -rw-r--r-- 1 ncindex ncindex 2431846 Dec 3 14:16 _fq_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 2412246 Dec 3 14:16 _fq_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex 1834286 Dec 3 14:16 _fq_Lucene41_0.doc -rw-r--r-- 1 ncindex ncindex 1152 Dec 3 14:16 _fq.fdx -rw-r--r-- 1 ncindex ncindex 2518453 Dec 3 14:16 _fq.fdt -rw-r--r-- 1 ncindex ncindex 2518453 Dec 3 14:16 _fq.fdt Every other segment is at least ten times as large as the others. It writes the large segment first. Here's an example of a large segment. Both of the segment listings above are from small segments: -rw-r--r-- 1 ncindex ncindex 11289877 Dec 3 14:21 _ft.fdt -rw-r--r-- 1 ncindex ncindex 7757 Dec 3 14:21 _ft.fdx -rw-r--r-- 1 ncindex ncindex 3114 Dec 3 14:21 _ft.fnm -rw-r--r-- 1 ncindex ncindex 8304619 Dec 3 14:21 _ft_Lucene41_0.doc -rw-r--r-- 1 ncindex ncindex 9054058 Dec 3 14:21 _ft_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex 9666900 Dec 3 14:21 _ft_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 244322 Dec 3 14:21 _ft_Lucene41_0.tip -rw-r--r-- 1 ncindex ncindex 115 Dec 3 14:21 _ft_nrm.cfe -rw-r--r-- 1 ncindex ncindex 170365 Dec 3 14:21 _ft_nrm.cfs -rw-r--r-- 1 ncindex ncindex 411 Dec 3 14:21 _ft.si -rw-r--r-- 1 ncindex ncindex 113554 Dec 3 14:21 _ft.tvd -rw-r--r-- 1 ncindex ncindex 23374630 Dec 3 14:21 _ft.tvf -rw-r--r-- 1 ncindex ncindex 908209 Dec 3 14:21 _ft.tvx Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Odd pattern of segment creation - branch_4x bug?
On 12/3/2012 2:39 PM, Shawn Heisey wrote: What's happening is that whenever it hits maxDocs, I get 2 segment files, one of them significantly smaller than the other. Rarely, it creates 3 segments! I know it's not a ramBuffer problem, because initially the exact same thing was happening with maxDocs at 10 and a 32MB ramBuffer. I raised the ramBuffer and lowered the maxDocs. It takes significantly less than 5 minutes maxDocs to get indexed, so the maxTime value should not be a factor. See the previous full message for details referenced below. Looking at those listings again, I can see that the _fq segment wasn't complete either - the fnm, _nrm.cfe, and _nrm.cfs files were missing. It looks like both of the segments that get created by each autocommit are incomplete. The full-import I wrote about earlier is still going, here are the last two segments: -rw-r--r-- 1 ncindex ncindex3505648 Dec 3 19:41 _nd.fdt -rw-r--r-- 1 ncindex ncindex 2017 Dec 3 19:41 _nd.fdx -rw-r--r-- 1 ncindex ncindex2346930 Dec 3 19:41 _nd_Lucene41_0.doc -rw-r--r-- 1 ncindex ncindex3066874 Dec 3 19:41 _nd_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex3690545 Dec 3 19:41 _nd_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 64304 Dec 3 19:41 _nd_Lucene41_0.tip -rw-r--r-- 1 ncindex ncindex411 Dec 3 19:41 _nd.si -rw-r--r-- 1 ncindex ncindex 22002 Dec 3 19:41 _nd.tvd -rw-r--r-- 1 ncindex ncindex6853272 Dec 3 19:41 _nd.tvf -rw-r--r-- 1 ncindex ncindex 175793 Dec 3 19:41 _nd.tvx -rw-r--r-- 1 ncindex ncindex8814592 Dec 3 19:43 _ne_Lucene41_0.doc -rw-r--r-- 1 ncindex ncindex 11911168 Dec 3 19:43 _ne_Lucene41_0.pos -rw-r--r-- 1 ncindex ncindex8830976 Dec 3 19:43 _ne_Lucene41_0.tim -rw-r--r-- 1 ncindex ncindex 204396 Dec 3 19:43 _ne_Lucene41_0.tip As you can see, _nd is missing the fnm and nrm files and _ne is missing lots of files. When the next autocommit happened, _nf and _ng were created, and both of the segments listed above were completed. I still need to do some additional testing, but I am pretty sure that when autocommit is turned off, all segments are very uniform in size and only get created one at a time. I will also try with and without updateLog. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Odd pattern of segment creation - branch_4x bug?
On Mon, Dec 3, 2012 at 9:59 PM, Shawn Heisey s...@elyograg.org wrote: As you can see, _nd is missing the fnm and nrm files and _ne is missing lots of files. When the next autocommit happened, _nf and _ng were created, and both of the segments listed above were completed. Ah, I think you're just seeing side-effects from soft commits (AKA NRT-reopens) in conjunction with NRTCachingDirectory? A soft commit simply flushes a segment to disk (but doesn't write a new segments file to reference those files and hence can avoid fsyncing those files), and then opens a new reader. the NRTCachingDirectory is a write-back cache that caches small files in memory (and that's prob why you don't see all of the files on disk). -Yonik http://lucidworks.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Odd pattern of segment creation - branch_4x bug?
On 12/3/2012 8:44 PM, Yonik Seeley wrote: On Mon, Dec 3, 2012 at 9:59 PM, Shawn Heisey s...@elyograg.org wrote: As you can see, _nd is missing the fnm and nrm files and _ne is missing lots of files. When the next autocommit happened, _nf and _ng were created, and both of the segments listed above were completed. Ah, I think you're just seeing side-effects from soft commits (AKA NRT-reopens) in conjunction with NRTCachingDirectory? A soft commit simply flushes a segment to disk (but doesn't write a new segments file to reference those files and hence can avoid fsyncing those files), and then opens a new reader. the NRTCachingDirectory is a write-back cache that caches small files in memory (and that's prob why you don't see all of the files on disk). My main concern in bringing this up wasn't the missing files, it was the fact that every autocommit generates two segments instead of one, which makes merges happen more frequently. Is that expected behavior? It was my understanding that autocommit always did a hard commit. Is that still the case, and I am just seeing something *like* a soft commit courtesy of NRTCachingDirectory, or do I need a different config to ensure they are hard commits?Here's my full updateHandler config: updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs65536/maxDocs maxTime30/maxTime /autoCommit updateLog / /updateHandler Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Odd pattern of segment creation - branch_4x bug?
On Mon, Dec 3, 2012 at 11:36 PM, Shawn Heisey s...@elyograg.org wrote: updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs65536/maxDocs maxTime30/maxTime /autoCommit updateLog / /updateHandler Yeah, seems like that just should generate hard commits. Are you using delete by querys at all? The current implementation can do an NRT reopen as part of it's implementation. -Yonik http://lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org