Re: mergeFactor / indexing speed

Chantal Ackermann Mon, 03 Aug 2009 09:32:41 -0700

Hi all,

I'm still struggling with the index performance. I've moved the indexer
to a different machine, now, which is faster and less occupied.


The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.

It has been processing roughly 70k documents in half an hour, so far.Which means 1,5 hours at least for 200k - which is as fast/slow asbefore (on the less performant machine).


The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
 iostat
Linux 2.6.9-67.ELsmp      08/03/2009

avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.23    0.00    0.03    0.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked thatfrom my own machine, and did only a ping from the linux box to the dbserver.)


Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

 > It could very well be the case that you aren't seeing any merges with
 > only 20K docs.  Ultimately, if you really want to, you can look in
 > your data.dir and count the files.  If you have indexed a lot and have
 > an MF of 100 and haven't done an optimize, you will see a lot more
 > index files.

Do you mean that 20k is not representative enough to test those settings?
I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set, of
course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is very
stable).

It "feels kinda" slow to me...
Out of your experience - what would you expect as duration for an index
with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

 > Likely, but not guaranteed.  Typically, larger merge factors are good
 > for batch indexing, but a lot of that has changed with Lucene's new
 > background merger, such that I don't know if it matters as much anymore.

Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly, and the
indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is update
every few hours. I want to put in place an incremental/partial update as
main process, but full indexing might have to be done at certain times
if data has changed completely, or the schema has to be changed/extended.

 > No, those are separate things.  The ramBufferSizeMB (although, I like
 > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
 > Lucene holds in memory before it has to flush.  MF controls how many
 > segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:

Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
<str name="Time taken ">0:11:46.792</str>
mergeFactor: 100
/admin/cores?action=RELOAD
<str name="Time taken ">0:11:44.441</str>
Tomcat restart
<str name="Time taken ">0:11:34.143</str>

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
ATA disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the up-
to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the
index is running)?

It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the files.  If you have indexed a lot and have
an MF of 100 and haven't done an optimize, you will see a lot more
index files.

2. I changed the mergeFactor in both available settings (default and
main index) in the solrconfig.xml file of the core I am reindexing.
That is the correct place? Should a change in performance be
noticeable when increasing from 10 to 100? Or is the change not
perceivable if the requests for data are taking far longer than all
the indexing itself?

Likely, but not guaranteed.  Typically, larger merge factors are good
for batch indexing, but a lot of that has changed with Lucene's new
background merger, such that I don't know if it matters as much anymore.

3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
(Or some other setting?)

No, those are separate things.  The ramBufferSizeMB (although, I like
the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
Lucene holds in memory before it has to flush.  MF controls how many
segments are on disk

(I am still trying to get profiling information on how much
application time is eaten up by db connection/requests/processing.
The root entity query is about (average) 20ms. The child entity
query is less than 10ms.
I have my custom entity processor running on the child entity that
populates the map using a multi-row result set. I have also attached
one regex and one script transformer.)

Thank you for any tips!
Chantal



--
Chantal Ackermann

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: mergeFactor / indexing speed

Reply via email to