On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> tldr: Recently, I tried moving an existing solrcloud configuration from a 
> local datacenter to EC2. Performance was roughly 1/10th what I’d expected, 
> until I applied a bunch of linux tweaks.

How very strange.  I knew virtualization would have overheard, possibly
even measurable overhead, but that's insane.  Running on bare metal is
always better if you can do it.  I would be curious what would happen on
your original install if you applied similar tuning to that.  Would you
see a speedup there?

> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much 
> more recent release) alternate implementation of the same index was not 
> seeing this high-system-time behavior on EC2, and was getting throughput 
> consistent with our general expectations.

That's even weirder.  ES 5.x will likely be using Points field types for
numeric fields, and although those are faster than what Solr currently
uses, I doubt it could explain that difference.  The implication here is
that the ES systems are running with stock EC2 settings, not the tuned
settings ... but I'd like you to confirm that.  Same Java version as
with Solr?  IMHO, Java itself is more likely to cause issues like you
saw than Solr.

> I’m writing this for a few reasons:
>
> 1.       The performance difference was so crazy I really feel like this 
> should really be broader knowledge.

Definitely agree!  I would be very interested in learning which of the
tunables you changed were major contributors to the improvement.  If it
turns out that Solr's code is sub-optimal in some way, maybe we can fix it.

> 2.       If anyone is aware of anything that changed in Lucene between 5.4 
> and 6.x that could explain why Elasticsearch wasn’t suffering from this? If 
> it’s the clocksource that’s the issue, there’s an implication that Solr was 
> using tons more system calls like gettimeofday that the EC2 (xen) hypervisor 
> doesn’t allow in userspace.

I had not considered the performance regression in 6.4.0 and 6.4.1 that
Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?

=============

Specific thoughts on the tuning:

The noatime option is very good to use.  I also use nodiratime on my
systems.  Turning these off can have *massive* impacts on disk
performance.  If these are the source of the speedup, then the machine
doesn't have enough spare memory.

I'd be wary of the "nobarrier" mount option.  If the underlying storage
has battery-backed write caches, or is SSD without write caching, it
wouldn't be a problem.  Here's info about the "discard" mount option, I
don't know whether it applies to your amazon storage:

       discard/nodiscard
              Controls  whether ext4 should issue discard/TRIM commands
to the
              underlying block device when blocks are freed.  This  is 
useful
              for  SSD  devices  and sparse/thinly-provisioned LUNs, but
it is
              off by default until sufficient testing has been done.

The network tunables would have more of an effect in a distributed
environment like EC2 than they would on a LAN.

Thanks,
Shawn

Reply via email to