Ah, OK. For very slowly changing indexes optimize can makes sense.

Do note, though, that if you incrementally index after the full build, and
especially if you update documents, you're laying a trap for the future. Let's
say you optimize down to a single segment. The default TieredMergePolicy
tries to merge "similar size segments". But now you have one huge segment
and docs will be marked as deleted from that segment, but not cleaned up
until that segment is merged, which won't happen for a long time since it
is so much bigger (I'm assuming) than the segments the incremental indexing
will create.

Now, the percentage of deleted documents weighs quite heavily in the decision
what segments to merge, so it might not matter. It's just something to
be aware of.
Surely benchmarking is in order as you indicated.

The Lucene-level IndexWriter.forceMerge method seems to be what you need
though, although if you're working over HDFS I'm in unfamiliar territory. But
the constructors to IndexWriter take a Directory, and the HdfsDirectory
extends BaseDirectory which extends Directory so if you can set up
an HdfsDIrectory it should "just work". I haven't personally tried it though.

I saw something recently where optimization helped considerably in a
sharded situation where the rows parameter was 400 (10 shards). My
belief is that what was really happening was that the first-pass of a
distributed search was getting slowed by disk seeks across multiple
smaller segments. I'm waiting for SOLR-6810 which should impact that
problem. Don't know if it applies to your situation or not though.

HTH,
Erick


On Mon, Jun 15, 2015 at 8:30 PM, Shenghua(Daniel) Wan
<wansheng...@gmail.com> wrote:
> Hi, Erick,
> First thanks for sharing the ideas. I am further giving more context here
> accordingly.
>
> 1. why optimize? I have done some experiments to compare the query response
> time, and there is some difference. In addition, the searcher will be
> customer-facing. I think any performance boost will be worthwhile unless
> the indexing will be more frequent. However, more benchmark will be
> necessary to quantize the margin.
>
> 2. Why embedded solr server? I adopted the idea from Mark Miller's
> map-reduce indexing and build on top of its original contribution to Solr.
> It launches an embedded solr server at the end of reducer stages. Basically
> a solr "instance" is brought up and fed with documents. Then the index is
> generated at each reducer. Then the indexes are merged, and optimized if
> desired.
>
> Thanks.
>
> On Mon, Jun 15, 2015 at 5:06 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> The first question is why you're optimizing at all. It's not recommended
>> unless you can demonstrate that an optimized index is giving you enough
>> of a performance boost to be worth the effort.
>>
>> And why are you using embedded solr server? That's kind of unusual
>> so I wonder if you've gone down a wrong path somewhere. In other
>> words this feels like an XY problem, you're specifically asking about
>> a task without explaining the problem you're trying to solve, there may
>> be better alternatives.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 15, 2015 at 4:56 PM, Shenghua(Daniel) Wan
>> <wansheng...@gmail.com> wrote:
>> > Hi,
>> > Do you have any suggestions to improve the performance for merging and
>> > optimizing index?
>> > I have been using embedded solr server to merge and optimize the index. I
>> > am looking for the right parameters to tune. My use case have about 300
>> > fields plus 250 copyfields, and moderate doc size (about 65K each doc
>> > averagely)
>> >
>> > https://wiki.apache.org/solr/MergingSolrIndexes does not help much.
>> >
>> > Thanks a lot for any ideas and suggestions.
>> >
>> > --
>> >
>> > Regards,
>> > Shenghua (Daniel) Wan
>>
>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan

Reply via email to