Re: Indexing CPU performance

Shalin Shekhar Mangar Tue, 14 Mar 2017 02:34:25 -0700

According to the profiler output, a significant amount of cpu is being
spent in JSON parsing but your previous email said that you use SolrJ.
SolrJ uses the javabin binary format to send documents to Solr and it
never ever uses JSON so there is definitely some other indexing
process that you have not accounted for.


On Tue, Mar 14, 2017 at 12:31 AM, Mahmoud Almokadem
<prog.mahm...@gmail.com> wrote:
> Thanks Erick,
>
> I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
> second from single producer.
>
> Regarding more shards, you mean use 2 nodes with 8 shards per node so we
> got 16 shards on the same 2 nodes or spread shards over more nodes?
>
> I'm using solr 6.4.1 with zookeeper on the same nodes.
>
> Here's what I got from sematext profiler
>
> 51%
> Thread.java:745java.lang.Thread#run
>
> 42%
> QueuedThreadPool.java:589
> org.eclipse.jetty.util.thread.QueuedThreadPool$2#run
> Collapsed 29 calls (Expand)
>
> 43%
> UpdateRequestHandler.java:97
> org.apache.solr.handler.UpdateRequestHandler$1#load
>
> 30%
> JsonLoader.java:78org.apache.solr.handler.loader.JsonLoader#load
>
> 30%
> JsonLoader.java:115
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader#load
>
> 13%
> JavabinLoader.java:54org.apache.solr.handler.loader.JavabinLoader#load
>
> 9%
> ThreadPoolExecutor.java:617
> java.util.concurrent.ThreadPoolExecutor$Worker#run
>
> 9%
> ThreadPoolExecutor.java:1142
> java.util.concurrent.ThreadPoolExecutor#runWorker
>
> 33%
> ConcurrentMergeScheduler.java:626
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread#run
>
> 33%
> ConcurrentMergeScheduler.java:588
> org.apache.lucene.index.ConcurrentMergeScheduler#doMerge
>
> 33%
> SolrIndexWriter.java:233org.apache.solr.update.SolrIndexWriter#merge
>
> 33%
> IndexWriter.java:3920org.apache.lucene.index.IndexWriter#merge
>
> 33%
> IndexWriter.java:4343org.apache.lucene.index.IndexWriter#mergeMiddle
>
> 20%
> SegmentMerger.java:101org.apache.lucene.index.SegmentMerger#merge
>
> 11%
> SegmentMerger.java:89org.apache.lucene.index.SegmentMerger#merge
>
> 2%
> SegmentMerger.java:144org.apache.lucene.index.SegmentMerger#merge
>
>
> On Mon, Mar 13, 2017 at 5:12 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Note that 70,000 docs/second pretty much guarantees that there are
>> multiple shards. Lots of shards.
>>
>> But since you're using SolrJ, the  very first thing I'd try would be
>> to comment out the SolrClient.add(doclist) call so you're doing
>> everything _except_ send the docs to Solr. That'll tell you whether
>> there's any bottleneck on getting the docs from the system of record.
>> The fact that you're pegging the CPUs argues that you are feeding Solr
>> as fast as Solr can go so this is just a sanity check. But it's
>> simple/fast.
>>
>> As far as what on Solr could be the bottleneck, no real way to know
>> without profiling. But 300+ fields per doc probably just means you're
>> doing a lot of processing, I'm not particularly hopeful you'll be able
>> to speed things up without either more shards or simplifying your
>> schema.
>>
>> Best,
>> Erick
>>
>> On Mon, Mar 13, 2017 at 6:58 AM, Mahmoud Almokadem
>> <prog.mahm...@gmail.com> wrote:
>> > Hi great community,
>> >
>> > I have a SolrCloud with the following configuration:
>> >
>> >    - 2 nodes (r3.2xlarge 61GB RAM)
>> >    - 4 shards.
>> >    - The producer can produce 13,000+ docs per second
>> >    - The schema contains about 300+ fields and the document size is about
>> >    3KB.
>> >    - Using SolrJ and SolrCloudClient, each batch to solr contains 500
>> docs.
>> >
>> > When I start my bulk indexer program the CPU utilization is 100% on each
>> > server but the rate of the indexer is about 1500 docs per second.
>> >
>> > I know that some solr benchmarks reached 70,000+ doc. per second.
>> >
>> > The question: What is the best way to determine the bottleneck on solr
>> > indexing rate?
>> >
>> > Thanks,
>> > Mahmoud
>>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Indexing CPU performance

Reply via email to