Sure,
There is only one stack trace (that seems to be how the output for this tool
works) for java.lang.String.intern:
TRACE 300165:
java.lang.String.intern(<Unknown Source>:Unknown line)
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:74)
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
But there are several for org.apache.lucene.util.SimpleStringInterner.intern
and org.apache.lucene.util.StringHelper.intern:
TRACE 300252:
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:61)
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300339:
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:75)
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300262:
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.addInternal(FieldInfos.java:249)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:363)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300344:
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:78)
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300206:
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:54)
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
TRACE 300211:
org.apache.lucene.util.SimpleStringInterner.intern(SimpleStringInterner.java:51)
org.apache.lucene.util.StringHelper.intern(StringHelper.java:34)
org.apache.lucene.index.FieldInfos.read(FieldInfos.java:353)
org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
org.apache.lucene.index.SegmentReader$CoreReaders.<init>(SegmentReader.java:116)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:576)
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:554)
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:105)
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:75)
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
org.apache.lucene.index.IndexReader.open(IndexReader.java:188)
com.smartsheet.Main.main(<Unknown Source>:Unknown line)
-Mark
On Nov 17, 2010, at 1:51 PM, Michael McCandless wrote:
> Lucene interns field names... since you have a truly enormous number
> of unique fields it's expected intern will be called alot.
>
> But that said it's odd that it's this costly.
>
> Can you post the stack traces that call intern?
>
> Mike
>
> On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
> <[email protected]> wrote:
>> Hmm...
>>
>> So, I was going on this output from your CheckIndex:
>>
>> test: field norms.........OK [296713 fields]
>>
>> But in fact I just looked and that number is bogus -- it's always
>> equal to total number of fields, not number of fields with norms
>> enabled. I'll open an issue to fix this, but in the meantime can you
>> apply this patch to your CheckIndex and run it again?
>>
>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>> ===================================================================
>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision 1031678)
>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working copy)
>> @@ -570,8 +570,10 @@
>> }
>> final byte[] b = new byte[reader.maxDoc()];
>> for (final String fieldName : fieldNames) {
>> - reader.norms(fieldName, b, 0);
>> - ++status.totFields;
>> + if (reader.hasNorms(fieldName)) {
>> + reader.norms(fieldName, b, 0);
>> + ++status.totFields;
>> + }
>> }
>>
>> msg("OK [" + status.totFields + " fields]");
>>
>> So if in fact you have already disabled norms then something else is
>> the source of the sudden slowness. Though, such a huge number of
>> unique field names is not an area of Lucene that's very well tested...
>> perhaps there's something silly somewhere. Maybe you can try
>> profiling just the init of your IndexReader? (Eg, run java with
>> -agentlib:hprof=cpu=samples,depth=16,interval=1).
>>
>> Yes, both Index.NOT_ANALYZED_NO_NORMS and Index.NO will disable norms
>> as long as no document in the index ever had norms on (yes it does
>> "infect" heh).
>>
>> Mike
>>
>> On Fri, Nov 5, 2010 at 1:37 PM, Mark Kristensson
>> <[email protected]> wrote:
>>> While most of our Lucene indexes are used for more traditional searching,
>>> this index in particular is used more like a reporting repository. Thus, we
>>> really do need to have that many fields indexed and they do need to be
>>> broken out into separate fields. There may be another way to structure the
>>> index to reduce the number of fields, but I'm hoping we can optimize the
>>> current design and avoid (yet another) index redesign.
>>>
>>> I'll look into the tweaking the merge policy, but I'm more interested in
>>> disabling norms because scoring really doesn't matter for us. Basically, we
>>> need nothing more than a binary answer from Lucene: either a record meets
>>> the provided criteria (which can be a rather complex boolean query with
>>> many subqueries) or it doesn't. If the record does match, then we get the
>>> IDs from lucene and run off to get the live data from our primary data
>>> store and sort it (in Java) based upon criteria provided by the user, not
>>> by score.
>>>
>>> After our initial design mushroomed in size, we redesigned and now (I
>>> thought) do not have norms on any of the fields in this index. So, I'm
>>> wondering if there was something in the results from the CheckIndex that I
>>> provided which indicates to you that we may have norms still enabled? I
>>> know that if you have norms on any one document's field, then any other
>>> document with that same field will get "infected" with norms as well.
>>>
>>> My understanding is that any field that uses the constants
>>> Index.NOT_ANALYZED_NO_NORMS or Index.NO will not have norms on it,
>>> regardless of whether or not the field is stored. Is that not correct?
>>>
>>> Thanks,
>>> Mark
>>>
>>>
>>>
>>> On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:
>>>
>>>> Likely what happened is you had a bunch of smaller segments, and then
>>>> suddenly they got merged into that one big segment (_aiaz) in your
>>>> index.
>>>>
>>>> The representation for norms in particular is not sparse, so this
>>>> means the size of the norms file for a given segment will be
>>>> number-of-unique-indexed-fields X number-of-documents.
>>>>
>>>> So this count grows quadratically on merge.
>>>>
>>>> Do these fields really need to be indexed? If so, it'd be better to
>>>> use a single field for all users for the indexable text if you can.
>>>>
>>>> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
>>>> merge policy; this'd prevent big segments from being produced.
>>>> Disabling norms should also workaround this, though that will affect
>>>> hit scores...
>>>>
>>>> Mike
>>>>
>>>> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
>>>> <[email protected]> wrote:
>>>>> Yes, we do have a large number of unique field names in that index,
>>>>> because they are driven by user named fields in our application (with
>>>>> some cleaning to remove illegal chars).
>>>>>
>>>>> This slowness problem has appeared very suddenly in the last couple of
>>>>> weeks and the number of unique field names has not spiked in the last few
>>>>> weeks. Have we crept over some threshold with our linear growth in the
>>>>> number of unique field names? Perhaps there is a limit driven by the
>>>>> amount of RAM in the machine that we are violating? Are there any
>>>>> guidelines for the maximum number, or suggested number, of unique fields
>>>>> names in an index or segment? Any suggestions for potentially mitigating
>>>>> the problem?
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>>
>>>>>
>>>>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
>>>>>
>>>>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> I've run checkIndex against the index and the results are below. That
>>>>>>> net is that it's telling me nothing is wrong with the index.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>> I did not have any instrumentation around the opening of the
>>>>>>> IndexSearcher (we don't use an IndexReader), just around the actual
>>>>>>> query execution so I had to add some additional logging. What I found
>>>>>>> surprised me, opening a search against this index takes the same 6 to 8
>>>>>>> seconds that closing the indexWriter takes.
>>>>>>
>>>>>> IndexWriter opens a SegmentReader for each segment in the index, to
>>>>>> apply deletions, so I think this is the source of the slowness.
>>>>>>
>>>>>> From the CheckIndex output, it looks like you have many (296,713)
>>>>>> unique fields names on that one large segment -- does that sound
>>>>>> right? I suspect such a very high field count is the source of the
>>>>>> slowness...
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]