Re: IndexWriter.close() performance issue

Mark Kristensson Wed, 17 Nov 2010 13:58:29 -0800

Sure,

There is only one stack trace (that seems to be how the output for this tool 
works) for java.lang.String.intern:


TRACE 300165:
        java.lang.String.intern(<Unknown Source>:Unknown line)
        
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74)
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)

But there are several for org.apache.lucene.util.SimpleStringInterner.intern 
and org.apache.lucene.util.StringHelper.intern:


TRACE 300252:
        
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:61)
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300339:
        
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:75)
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300262:
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.addInternal(FieldInfos.java:249)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:363)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300344:
        
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:78)
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300206:
        
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:54)
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300211:
        
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:51)
        org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
        org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
        org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
        org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
        org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
        
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
        
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
        org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
        com.smartsheet.Main.main(<Unknown Source>:Unknown line)

-Mark



On Nov 17, 2010, at 1:51 PM, Michael McCandless wrote:

> Lucene interns field names... since you have a truly enormous number
> of unique fields it's expected intern will be called alot.
> 
> But that said it's odd that it's this costly.
> 
> Can you post the stack traces that call intern?
> 
> Mike
> 
> On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
> <[email protected]> wrote:
>> Hmm...
>> 
>> So, I was going on this output from your CheckIndex:
>> 
>>   test: field norms.........OK [296713 fields]
>> 
>> But in fact I just looked and that number is bogus -- it's always
>> equal to total number of fields, not number of fields with norms
>> enabled.  I'll open an issue to fix this, but in the meantime can you
>> apply this patch to your CheckIndex and run it again?
>> 
>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>> ===================================================================
>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision 1031678)
>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working copy)
>> @@ -570,8 +570,10 @@
>>       }
>>       final byte[] b = new byte[reader.maxDoc()];
>>       for (final String fieldName : fieldNames) {
>> -        reader.norms(fieldName, b, 0);
>> -        ++status.totFields;
>> +        if (reader.hasNorms(fieldName)) {
>> +          reader.norms(fieldName, b, 0);
>> +          ++status.totFields;
>> +        }
>>       }
>> 
>>       msg("OK [" + status.totFields + " fields]");
>> 
>> So if in fact you have already disabled norms then something else is
>> the source of the sudden slowness.  Though, such a huge number of
>> unique field names is not an area of Lucene that's very well tested...
>> perhaps there's something silly somewhere.  Maybe you can try
>> profiling just the init of your IndexReader?  (Eg, run java with
>> -agentlib:hprof=cpu=samples,depth=16,interval=1).
>> 
>> Yes, both Index.NOT_ANALYZED_NO_NORMS and Index.NO will disable norms
>> as long as no document in the index ever had norms on (yes it does
>> "infect" heh).
>> 
>> Mike
>> 
>> On Fri, Nov 5, 2010 at 1:37 PM, Mark Kristensson
>> <[email protected]> wrote:
>>> While most of our Lucene indexes are used for more traditional searching, 
>>> this index in particular is used more like a reporting repository. Thus, we 
>>> really do need to have that many fields indexed and they do need to be 
>>> broken out into separate fields. There may be another way to structure the 
>>> index to reduce the number of fields, but I'm hoping we can optimize the 
>>> current design and avoid (yet another) index redesign.
>>> 
>>> I'll look into the tweaking the merge policy, but I'm more interested in 
>>> disabling norms because scoring really doesn't matter for us. Basically, we 
>>> need nothing more than a binary answer from Lucene: either a record meets 
>>> the provided criteria (which can be a rather complex boolean query with 
>>> many subqueries) or it doesn't. If the record does match, then we get the 
>>> IDs from lucene and run off to get the live data from our primary data 
>>> store and sort it (in Java) based upon criteria provided by the user, not 
>>> by score.
>>> 
>>> After our initial design mushroomed in size, we redesigned and now (I 
>>> thought) do not have norms on any of the fields in this index. So, I'm 
>>> wondering if there was something in the results from the CheckIndex that I 
>>> provided which indicates to you that we may have norms still enabled? I 
>>> know that if you have norms on any one document's field, then any other 
>>> document with that same field will get "infected" with norms as well.
>>> 
>>> My understanding is that any field that uses the constants  
>>> Index.NOT_ANALYZED_NO_NORMS or  Index.NO will not  have norms on it, 
>>> regardless of whether or not the field is stored. Is that not correct?
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> 
>>> 
>>> On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:
>>> 
>>>> Likely what happened is you had a bunch of smaller segments, and then
>>>> suddenly they got merged into that one big segment (_aiaz) in your
>>>> index.
>>>> 
>>>> The representation for norms in particular is not sparse, so this
>>>> means the size of the norms file for a given segment will be
>>>> number-of-unique-indexed-fields X number-of-documents.
>>>> 
>>>> So this count grows quadratically on merge.
>>>> 
>>>> Do these fields really need to be indexed?   If so, it'd be better to
>>>> use a single field for all users for the indexable text if you can.
>>>> 
>>>> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
>>>> merge policy; this'd prevent big segments from being produced.
>>>> Disabling norms should also workaround this, though that will affect
>>>> hit scores...
>>>> 
>>>> Mike
>>>> 
>>>> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
>>>> <[email protected]> wrote:
>>>>> Yes, we do have a large number of unique field names in that index, 
>>>>> because they are driven by user named fields in our application (with 
>>>>> some cleaning to remove illegal chars).
>>>>> 
>>>>> This slowness problem has appeared very suddenly in the last couple of 
>>>>> weeks and the number of unique field names has not spiked in the last few 
>>>>> weeks. Have we crept over some threshold with our linear growth in the 
>>>>> number of unique field names? Perhaps there is a limit driven by the 
>>>>> amount of RAM in the machine that we are violating? Are there any 
>>>>> guidelines for the maximum number, or suggested number, of unique fields 
>>>>> names in an index or segment? Any suggestions for potentially mitigating 
>>>>> the problem?
>>>>> 
>>>>> Thanks,
>>>>> Mark
>>>>> 
>>>>> 
>>>>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
>>>>> 
>>>>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> I've run checkIndex against the index and the results are below. That 
>>>>>>> net is that it's telling me nothing is wrong with the index.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> I did not have any instrumentation around the opening of the 
>>>>>>> IndexSearcher (we don't use an IndexReader), just around the actual 
>>>>>>> query execution so I had to add some additional logging. What I found 
>>>>>>> surprised me, opening a search against this index takes the same 6 to 8 
>>>>>>> seconds that closing the indexWriter takes.
>>>>>> 
>>>>>> IndexWriter opens a SegmentReader for each segment in the index, to
>>>>>> apply deletions, so I think this is the source of the slowness.
>>>>>> 
>>>>>> From the CheckIndex output, it looks like you have many (296,713)
>>>>>> unique fields names on that one large segment -- does that sound
>>>>>> right?  I suspect such a very high field count is the source of the
>>>>>> slowness...
>>>>>> 
>>>>>> Mike
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>> 
>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: IndexWriter.close() performance issue

Reply via email to