It's also very important to consider the type of EC2 instance you are
using...

We settled on the R4.2XL...  The R series is labeled "High-Memory"

Which instance type did you end up using?

On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> > tldr: Recently, I tried moving an existing solrcloud configuration from
> a local datacenter to EC2. Performance was roughly 1/10th what I’d
> expected, until I applied a bunch of linux tweaks.
>
> How very strange.  I knew virtualization would have overheard, possibly
> even measurable overhead, but that's insane.  Running on bare metal is
> always better if you can do it.  I would be curious what would happen on
> your original install if you applied similar tuning to that.  Would you
> see a speedup there?
>
> > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
> much more recent release) alternate implementation of the same index was
> not seeing this high-system-time behavior on EC2, and was getting
> throughput consistent with our general expectations.
>
> That's even weirder.  ES 5.x will likely be using Points field types for
> numeric fields, and although those are faster than what Solr currently
> uses, I doubt it could explain that difference.  The implication here is
> that the ES systems are running with stock EC2 settings, not the tuned
> settings ... but I'd like you to confirm that.  Same Java version as
> with Solr?  IMHO, Java itself is more likely to cause issues like you
> saw than Solr.
>
> > I’m writing this for a few reasons:
> >
> > 1.       The performance difference was so crazy I really feel like this
> should really be broader knowledge.
>
> Definitely agree!  I would be very interested in learning which of the
> tunables you changed were major contributors to the improvement.  If it
> turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
>
> > 2.       If anyone is aware of anything that changed in Lucene between
> 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
> this? If it’s the clocksource that’s the issue, there’s an implication that
> Solr was using tons more system calls like gettimeofday that the EC2 (xen)
> hypervisor doesn’t allow in userspace.
>
> I had not considered the performance regression in 6.4.0 and 6.4.1 that
> Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
>
> =============
>
> Specific thoughts on the tuning:
>
> The noatime option is very good to use.  I also use nodiratime on my
> systems.  Turning these off can have *massive* impacts on disk
> performance.  If these are the source of the speedup, then the machine
> doesn't have enough spare memory.
>
> I'd be wary of the "nobarrier" mount option.  If the underlying storage
> has battery-backed write caches, or is SSD without write caching, it
> wouldn't be a problem.  Here's info about the "discard" mount option, I
> don't know whether it applies to your amazon storage:
>
>        discard/nodiscard
>               Controls  whether ext4 should issue discard/TRIM commands
> to the
>               underlying block device when blocks are freed.  This  is
> useful
>               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
> it is
>               off by default until sufficient testing has been done.
>
> The network tunables would have more of an effect in a distributed
> environment like EC2 than they would on a LAN.
>
> Thanks,
> Shawn
>
>

Reply via email to