Re: Odd pattern of segment creation - branch_4x bug?

2012-12-04 Thread Shawn Heisey

On 12/3/2012 9:46 PM, Yonik Seeley wrote:

On Mon, Dec 3, 2012 at 11:36 PM, Shawn Heisey s...@elyograg.org wrote:

updateHandler class=solr.DirectUpdateHandler2
   autoCommit
 maxDocs65536/maxDocs
 maxTime30/maxTime
   /autoCommit
   updateLog /
/updateHandler

Yeah, seems like that just should generate hard commits.
Are you using delete by querys at all?  The current implementation can
do an NRT reopen as part of it's implementation.


When I do deletes, it's done exclusively using deleteByQuery in SolrJ, 
with something like did:(1000 OR 1001 OR 1002 OR 1003) as the query.  
Before doing the delete, I do a search using the same query and only 
proceed to do the delete on that shard if numFound from the query is 
more than zero.


I don't know if this is necessary information, but no deletes are 
happening while I do my full-import.  Nothing else happens to the index 
at all while DIH is doing its thing.


This is new experimentation with autocommit.  By using autocommit, the 
commit at the end of an import is much faster. When autocommit is turned 
off, it is several minutes between the last document add and the handler 
returning to idle status.  I like this new faster commit, but if I 
can't get an autocommit to only produce one segment, I may go back to 
the old way, to reduce the I/O impact from the more frequent merges.


Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Odd pattern of segment creation - branch_4x bug?

2012-12-03 Thread Shawn Heisey

I'm using Solr compiled from a branch_4x checkout.

solr-impl4.1-SNAPSHOT 1416639M - ncindex - 2012-12-03 12:54:38

I've noticed something really odd happening during DIH full-import of 
millions of documents, and I'm wondering if it's a bug.  Configbits that 
I think may be relevant are below.  If you'd like more information, 
please let me know what you'd like and whether I need to turn on 
settings like infostream and do another import:


Autocommit is set to maxDocs 65536 docs and maxTime 30.
ramBufferSizeMB is 100.
updateLog is enabled, no options.

What's happening is that whenever it hits maxDocs, I get 2 segment 
files, one of them significantly smaller than the other.  Rarely, it 
creates 3 segments!  I know it's not a ramBuffer problem, because 
initially the exact same thing was happening with maxDocs at 10 and 
a 32MB ramBuffer.  I raised the ramBuffer and lowered the maxDocs.  It 
takes significantly less than 5 minutes maxDocs to get indexed, so the 
maxTime value should not be a factor.


Sometimes the last segment is incomplete until the next autocommit, 
consisting only of files like the following.  On the next autocommit, 
the incomplete segment is completed.


-rw-r--r-- 1 ncindex ncindex   411 Dec  3 14:22 _fu.si
-rw-r--r-- 1 ncindex ncindex 55966 Dec  3 14:22 _fu_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex   1983125 Dec  3 14:22 _fu_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   1720492 Dec  3 14:22 _fu_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex   1384931 Dec  3 14:22 _fu_Lucene41_0.doc

Sometimes the last segment does get written completely before the next 
autocommit.  I have no idea what makes things happen differently sometimes:


-rw-r--r-- 1 ncindex ncindex144497 Dec  3 14:16 _fq.tvx
-rw-r--r-- 1 ncindex ncindex   6106209 Dec  3 14:16 _fq.tvf
-rw-r--r-- 1 ncindex ncindex 18090 Dec  3 14:16 _fq.tvd
-rw-r--r-- 1 ncindex ncindex   411 Dec  3 14:16 _fq.si
-rw-r--r-- 1 ncindex ncindex 67683 Dec  3 14:16 _fq_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex   2431846 Dec  3 14:16 _fq_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   2412246 Dec  3 14:16 _fq_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex   1834286 Dec  3 14:16 _fq_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex  1152 Dec  3 14:16 _fq.fdx
-rw-r--r-- 1 ncindex ncindex   2518453 Dec  3 14:16 _fq.fdt
-rw-r--r-- 1 ncindex ncindex   2518453 Dec  3 14:16 _fq.fdt

Every other segment is at least ten times as large as the others. It 
writes the large segment first.  Here's an example of a large segment.  
Both of the segment listings above are from small segments:


-rw-r--r-- 1 ncindex ncindex 11289877 Dec  3 14:21 _ft.fdt
-rw-r--r-- 1 ncindex ncindex 7757 Dec  3 14:21 _ft.fdx
-rw-r--r-- 1 ncindex ncindex 3114 Dec  3 14:21 _ft.fnm
-rw-r--r-- 1 ncindex ncindex  8304619 Dec  3 14:21 _ft_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex  9054058 Dec  3 14:21 _ft_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex  9666900 Dec  3 14:21 _ft_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex   244322 Dec  3 14:21 _ft_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex  115 Dec  3 14:21 _ft_nrm.cfe
-rw-r--r-- 1 ncindex ncindex   170365 Dec  3 14:21 _ft_nrm.cfs
-rw-r--r-- 1 ncindex ncindex  411 Dec  3 14:21 _ft.si
-rw-r--r-- 1 ncindex ncindex   113554 Dec  3 14:21 _ft.tvd
-rw-r--r-- 1 ncindex ncindex 23374630 Dec  3 14:21 _ft.tvf
-rw-r--r-- 1 ncindex ncindex   908209 Dec  3 14:21 _ft.tvx

Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Odd pattern of segment creation - branch_4x bug?

2012-12-03 Thread Shawn Heisey

On 12/3/2012 2:39 PM, Shawn Heisey wrote:
What's happening is that whenever it hits maxDocs, I get 2 segment 
files, one of them significantly smaller than the other. Rarely, it 
creates 3 segments!  I know it's not a ramBuffer problem, because 
initially the exact same thing was happening with maxDocs at 10 
and a 32MB ramBuffer.  I raised the ramBuffer and lowered the 
maxDocs.  It takes significantly less than 5 minutes maxDocs to get 
indexed, so the maxTime value should not be a factor.


See the previous full message for details referenced below.

Looking at those listings again, I can see that the _fq segment wasn't 
complete either - the fnm, _nrm.cfe, and _nrm.cfs files were missing.  
It looks like both of the segments that get created by each autocommit 
are incomplete.  The full-import I wrote about earlier is still going, 
here are the last two segments:


-rw-r--r-- 1 ncindex ncindex3505648 Dec  3 19:41 _nd.fdt
-rw-r--r-- 1 ncindex ncindex   2017 Dec  3 19:41 _nd.fdx
-rw-r--r-- 1 ncindex ncindex2346930 Dec  3 19:41 _nd_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex3066874 Dec  3 19:41 _nd_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex3690545 Dec  3 19:41 _nd_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex  64304 Dec  3 19:41 _nd_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex411 Dec  3 19:41 _nd.si
-rw-r--r-- 1 ncindex ncindex  22002 Dec  3 19:41 _nd.tvd
-rw-r--r-- 1 ncindex ncindex6853272 Dec  3 19:41 _nd.tvf
-rw-r--r-- 1 ncindex ncindex 175793 Dec  3 19:41 _nd.tvx
-rw-r--r-- 1 ncindex ncindex8814592 Dec  3 19:43 _ne_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex   11911168 Dec  3 19:43 _ne_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex8830976 Dec  3 19:43 _ne_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex 204396 Dec  3 19:43 _ne_Lucene41_0.tip

As you can see, _nd is missing the fnm and nrm files and _ne is missing 
lots of files.  When the next autocommit happened, _nf and _ng were 
created, and both of the segments listed above were completed.


I still need to do some additional testing, but I am pretty sure that 
when autocommit is turned off, all segments are very uniform in size and 
only get created one at a time.  I will also try with and without updateLog.


Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Odd pattern of segment creation - branch_4x bug?

2012-12-03 Thread Yonik Seeley
On Mon, Dec 3, 2012 at 9:59 PM, Shawn Heisey s...@elyograg.org wrote:
 As you can see, _nd is missing the fnm and nrm files and _ne is missing lots
 of files.  When the next autocommit happened, _nf and _ng were created, and
 both of the segments listed above were completed.

Ah, I think you're just seeing side-effects from soft commits (AKA
NRT-reopens) in conjunction with NRTCachingDirectory?
A soft commit simply flushes a segment to disk (but doesn't write a
new segments file to reference those files and hence can avoid
fsyncing those files), and then opens a new reader.  the
NRTCachingDirectory is a write-back cache that caches small files in
memory (and that's prob why you don't see all of the files on disk).

-Yonik
http://lucidworks.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Odd pattern of segment creation - branch_4x bug?

2012-12-03 Thread Shawn Heisey

On 12/3/2012 8:44 PM, Yonik Seeley wrote:

On Mon, Dec 3, 2012 at 9:59 PM, Shawn Heisey s...@elyograg.org wrote:

As you can see, _nd is missing the fnm and nrm files and _ne is missing lots
of files.  When the next autocommit happened, _nf and _ng were created, and
both of the segments listed above were completed.

Ah, I think you're just seeing side-effects from soft commits (AKA
NRT-reopens) in conjunction with NRTCachingDirectory?
A soft commit simply flushes a segment to disk (but doesn't write a
new segments file to reference those files and hence can avoid
fsyncing those files), and then opens a new reader.  the
NRTCachingDirectory is a write-back cache that caches small files in
memory (and that's prob why you don't see all of the files on disk).


My main concern in bringing this up wasn't the missing files, it was the 
fact that every autocommit generates two segments instead of one, which 
makes merges happen more frequently.  Is that expected behavior?


It was my understanding that autocommit always did a hard commit.  Is 
that still the case, and I am just seeing something *like* a soft commit 
courtesy of NRTCachingDirectory, or do I need a different config to 
ensure they are hard commits?Here's my full updateHandler config:


updateHandler class=solr.DirectUpdateHandler2
  autoCommit
maxDocs65536/maxDocs
maxTime30/maxTime
  /autoCommit
  updateLog /
/updateHandler

Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Odd pattern of segment creation - branch_4x bug?

2012-12-03 Thread Yonik Seeley
On Mon, Dec 3, 2012 at 11:36 PM, Shawn Heisey s...@elyograg.org wrote:
 updateHandler class=solr.DirectUpdateHandler2
   autoCommit
 maxDocs65536/maxDocs
 maxTime30/maxTime
   /autoCommit
   updateLog /
 /updateHandler

Yeah, seems like that just should generate hard commits.
Are you using delete by querys at all?  The current implementation can
do an NRT reopen as part of it's implementation.

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org