Also on EFS performance, because EFS is mounted with NFS, the one time I accidentally ran Solr with indexes on an NFS-mounted volume, it was 100X slower than local. It looks like they’ve improved that to only 10X slower than a local EBS volume.
So get off of EFS. Use local GP3 volumes. https://repost.aws/questions/QUqyZD98d0TbiluqPBW_zALw/how-to-get-comparable-performance-to-gp2-gp3-on-efs How to get comparable performance to gp2/gp3 on EFS? repost.aws wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Feb 28, 2024, at 11:18 PM, Gus Heck <[email protected]> wrote: > > Ah sorry my eyes flew past the long hard to read link straight to the > pretty table. Sorry. > > Yeah so a 10000 row grouping query is not a good idea. If you did it > paginated with cursor mark you would want to play around with trading off > number of requests vs size of request. very likely the optimal size is a > lot less than 10000 so long as the looping code isn't crazy inefficient but > it might be as high as 100 or even 500. No way to know for any particular > system and query other than testing it. > > As for how EFS could change it's performance on you, check out references > to "bursting credits" here: > https://docs.aws.amazon.com/efs/latest/ug/performance.html > > On Wed, Feb 28, 2024 at 10:55 PM Beale, Jim (US-KOP) > <[email protected]> wrote: > >> I did send the query. Here it is: >> >> >> http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2020240101}&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true >> <http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2020240101%7D&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true> >> >> I suppose all the indexes are about 150 GB so you are close. >> >> I set the limit to 10,000 or 5000 for these tests. Setting the limit at 10 >> or 50 would mean that there would need to be 1000-2000 requests. That seems >> like an awful lot to me. >> >> That is interesting about the export. I will look into other types of data >> collection. >> >> Also there is no quota on the EFS. It is apparently encrypted both ways. >> But if it is fast the one time, rebooting Solr shouldn't affect how it uses >> disk access. >> >> >> Jim Beale >> Lead Software Engineer >> hibu.com >> 2201 Renaissance Boulevard, King of Prussia, PA, 19406 >> Office: 610-879-3864 >> Mobile: 610-220-3067 >> >> >> >> -----Original Message----- >> From: Gus Heck <[email protected]> >> Sent: Wednesday, February 28, 2024 9:22 PM >> To: [email protected] >> Subject: Re: [EXTERNAL] Re: Is this list alive? I need help >> >> Caution! Attachments and links (urls) can contain deceptive and/or >> malicious content. >> >> Your description leads me to believe that at worst you have ~20M docs in >> one index, If the average doc size is 5k or so it sounds like 100GB.. This >> is smalish and across 3 machines it ought to be fine. Your time 1 values >> are very slow to begin with. Unfortunately you didn't send us the query, >> only the code that generates the query. A key bit not shown is what value >> you are passing in for limit (which is then set for rows. it *should* be >> something like 10 or 25 or 50. It should NOT be 1000 or 99999 etc... but >> the fact you have hardcoded the start to zero makes me think you are not >> paging and you are doing something in the "NOT" realm. If you are trying to >> export ALL matches to a query you'd be better off using /export rather than >> /select (reqquires docvalues for all fields involved) or if you don't have >> docvalues, use the cursormark feature to iteratively fetch pages of data. >> >> If you say rows-10000 then each node sends back 10000, the coordinator >> sorts all 30000 and then sends the top 10000 to the client.... >> >> Note that the grouping feature you are using can be heavy too. To do that >> in an /export context you would probably have to use streaming expressions >> and even there you would have to design carefully to avoid trying to hold >> large fractions of the index in memory while you formed groups... >> >> As for the change in speed I'm still betting on some sort of quota for >> your EFS access (R5 are fixed cpu availability so that's not it) However, >> it's worth looking at your GC logs in case your (probable) large queries >> are getting you into trouble with memory/GC. As with any performance >> troubleshooting you'll want to have eyes on the CPU load, disk io bytes, >> disk iOPs and network bandwidth. >> >> Oh one more thing that comes to mind. Make sure you don't configure ANY >> swap drive on your server. If the OS starts trying to put solr's cached >> memory on a swap disk the query times just go in the trash instantly. in >> most cases (YMMV) you would MUCH rather crash the server than have it start >> using swap. (because then you know you need a bigger server, rather than >> silently serving dog slow results while you limp along). >> >> -Gus >> >> On Wed, Feb 28, 2024 at 4:09 PM Beale, Jim (US-KOP) >> <[email protected]> wrote: >> >>> Here is the performance for this query on these nodes. You saw the >>> code in a previous email. >>> >>> >>> >>> >>> http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q >>> .op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dial >>> og_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2 >>> 020240101}&group=true&group.field=call_callerno&sort=call_date%20desc& >>> rows=10000&group.main=true >>> <http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true& >>> q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dia >>> log_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20T >>> O%2020240101%7d&group=true&group.field=call_callerno&sort=call_date%20 >>> desc&rows=10000&group.main=true> >>> >>> >>> >>> The two times given are right after a restart and the next day, or >>> sometime a few hours later. The only difference is how Solr is >>> running. I can’t understand what makes it run so slowly after a short >> while. >>> >>> >>> >>> Business_id >>> >>> Time 1 >>> >>> Time 2 >>> >>> 7016274253 >>> >>> 11.572 >>> >>> 23.397 >>> >>> 7010707194 >>> >>> 21.941 >>> >>> 21.414 >>> >>> 7000001491 >>> >>> 9.516 >>> >>> 39.051 >>> >>> 7029931968 >>> >>> 10.755 >>> >>> 59.196 >>> >>> 7014676602 >>> >>> 14.508 >>> >>> 14.083 >>> >>> 7004551760 >>> >>> 12.873 >>> >>> 36.856 >>> >>> 7016274253 >>> >>> 1.792 >>> >>> 17.415 >>> >>> 7010707194 >>> >>> 5.671 >>> >>> 25.442 >>> >>> 7000001491 >>> >>> 6.84 >>> >>> 36.244 >>> >>> 7029931968 >>> >>> 6.291 >>> >>> 38.483 >>> >>> 7014676602 >>> >>> 7.643 >>> >>> 12.584 >>> >>> 7004551760 >>> >>> 5.669 >>> >>> 21.977 >>> >>> 7029931968 >>> >>> 8.293 >>> >>> 36.688 >>> >>> 7008606979 >>> >>> 16.976 >>> >>> 30.569 >>> >>> 7002264530 >>> >>> 13.862 >>> >>> 35.113 >>> >>> 7017281920 >>> >>> 10.1 >>> >>> 31.914 >>> >>> 7000001491 >>> >>> 8.665 >>> >>> 35.141 >>> >>> 7058630709 >>> >>> 11.236 >>> >>> 38.104 >>> >>> 7011363889 >>> >>> 10.977 >>> >>> 19.72 >>> >>> 7016319075 >>> >>> 15.763 >>> >>> 26.023 >>> >>> 7053262466 >>> >>> 10.917 >>> >>> 48.3 >>> >>> 7000313815 >>> >>> 9.786 >>> >>> 24.617 >>> >>> 7015187150 >>> >>> 8.312 >>> >>> 29.485 >>> >>> 7016381845 >>> >>> 11.51 >>> >>> 34.545 >>> >>> 7016379523 >>> >>> 10.543 >>> >>> 29.27 >>> >>> 7026102159 >>> >>> 6.047 >>> >>> 30.381 >>> >>> 7010707194 >>> >>> 8.298 >>> >>> 27.069 >>> >>> 7016508018 >>> >>> 7.98 >>> >>> 34.48 >>> >>> 7016280579 >>> >>> 5.443 >>> >>> 26.617 >>> >>> 7016302809 >>> >>> 3.491 >>> >>> 12.578 >>> >>> 7016259866 >>> >>> 7.723 >>> >>> 33.462 >>> >>> 7016390730 >>> >>> 11.358 >>> >>> 32.997 >>> >>> 7013498165 >>> >>> 8.214 >>> >>> 26.004 >>> >>> 7016392929 >>> >>> 6.612 >>> >>> 19.711 >>> >>> 7007737612 >>> >>> 2.198 >>> >>> 4.19 >>> >>> 7012687678 >>> >>> 8.627 >>> >>> 35.342 >>> >>> 7016606704 >>> >>> 5.951 >>> >>> 21.732 >>> >>> 7007870203 >>> >>> 2.524 >>> >>> 16.534 >>> >>> 7016268227 >>> >>> 6.296 >>> >>> 25.651 >>> >>> 7016405011 >>> >>> 3.288 >>> >>> 18.541 >>> >>> 7016424246 >>> >>> 9.756 >>> >>> 31.243 >>> >>> 7000336592 >>> >>> 5.465 >>> >>> 31.486 >>> >>> 7004696397 >>> >>> 4.713 >>> >>> 29.528 >>> >>> 7016279283 >>> >>> 2.473 >>> >>> 24.243 >>> >>> 7016623672 >>> >>> 6.958 >>> >>> 35.96 >>> >>> 7016582537 >>> >>> 5.112 >>> >>> 33.475 >>> >>> 7015713947 >>> >>> 5.162 >>> >>> 25.972 >>> >>> 7003530665 >>> >>> 8.223 >>> >>> 26.549 >>> >>> 7012825693 >>> >>> 7.4 >>> >>> 16.849 >>> >>> 7010707194 >>> >>> 6.781 >>> >>> 23.835 >>> >>> 7079272278 >>> >>> 7.793 >>> >>> 24.686 >>> >>> >>> >>> *Jim Beale* >>> >>> *Lead Software Engineer * >>> >>> *hibu.com <http://hibu.com>* >>> >>> *2201 **Renaissance Boulevard**, King of Prussia, PA, **19406* >>> >>> *Office: 610-879-3864* >>> >>> *Mobile: 610-220-3067* >>> >>> >>> >>> >>> >>> *From:* Beale, Jim (US-KOP) <[email protected]> >>> *Sent:* Wednesday, February 28, 2024 3:29 PM >>> *To:* [email protected] >>> *Subject:* RE: [EXTERNAL] Re: Is this list alive? I need help >>> >>> >>> >>> *Caution!* >>> >>> Attachments and links (urls) can contain deceptive and/or malicious >>> content. >>> >>> I didn't see these responses because they were buried in my clutter >> folder. >>> >>> >>> >>> We have 12,541,505 docs for calls, 9,144,862 form fills, 53,838 SMS >>> and >>> 12,752 social leads. These are all a single Solr 9.1 cluster of three >>> nodes with PROD and UAT all on a single server. As follows: >>> >>> >>> >>> >>> >>> >>> >>> >>> The three nodes are r5.xlarge and we’re not sure if those are large >>> enough. The documents are not huge, from 1K to 25K each. >>> >>> >>> >>> samisolrcld.aws01.hibu.int is a load-balancer >>> >>> >>> >>> The request is >>> >>> >>> >>> async function getCalls(businessId, limit) { >>> >>> const config = { >>> >>> method: 'GET', >>> >>> url: http://samisolrcld.aws01.hibu.int:8983/solr/calls/select, >>> >>> params: { >>> >>> q: `business_id:${businessId} AND call_day:[20230101 TO >>> 20240101}`, >>> >>> fl: "business_id, call_id, call_day, call_date, >>> dialog_merged, call_callerno, call_duration, call_status, caller_name, >>> caller_address, caller_state, caller_city, caller_zip", >>> >>> rows: limit, >>> >>> start: 0, >>> >>> group: true, >>> >>> "group.main": true, >>> >>> "group.field": "call_callerno", >>> >>> sort: "call_day desc" >>> >>> } >>> >>> }; >>> >>> //console.log(config); >>> >>> >>> >>> let rval = []; >>> >>> while(true) { >>> >>> try { >>> >>> //console.log(config.params.start); >>> >>> const rsp = await axios(config); >>> >>> if(rsp.data && rsp.data.response) { >>> >>> let docs = rsp.data.response.docs; >>> >>> if(docs.length == 0) break; >>> >>> config.params.start += limit; >>> >>> rval = rval.concat(docs); >>> >>> } >>> >>> } catch (err) { >>> >>> console.log("Error: " + err.message); >>> >>> } >>> >>> } >>> >>> return rval; >>> >>> } >>> >>> >>> >>> You wrote: >>> >>> >>> >>> Note that EFS is encrypted file system, and stunnel is encrypted >>> transport, so for each disk read you likely causing: >>> >>> >>> >>> - read raw encrypted data from disk to memory (at AWS) >>> >>> - decrypt the disk data in memory (at AWS) >>> >>> - encrypt the memory data for stunnel transport (at AWS) >>> >>> - send the data over the wire >>> >>> - decrypt the data for use by solr. (Hardware you specify) >>> >>> >>> >>> That's guaranteed to be slow, and worse yet, you have no control at >>> all over the size or loading of the hardware performing anything but >>> the last step. You are completely at the mercy of AWS's cost/speed >>> tradeoffs which are unlikely to be targeting the level of performance >>> usually desired for search disk IO. >>> >>> >>> >>> This is interesting. I can copy the data to local and try it from there. >>> >>> >>> >>> >>> >>> >>> >>> Jim Beale >>> >>> Lead Software Engineer >>> >>> hibu.com >>> >>> 2201 Renaissance Boulevard, King of Prussia, PA, 19406 >>> >>> Office: 610-879-3864 >>> >>> Mobile: 610-220-3067 >>> >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Gus Heck <[email protected]> >>> Sent: Sunday, February 25, 2024 9:15 AM >>> To: [email protected] >>> Subject: [EXTERNAL] Re: Is this list alive? I need help >>> >>> >>> >>> Caution! Attachments and links (urls) can contain deceptive and/or >>> malicious content. >>> >>> >>> >>> Hi Jim, >>> >>> >>> >>> Welcome to the Solr user list, not sure why your are asking about list >>> liveliness? I don't see prior messages from you? >>> >>> https://lists.apache.org/[email protected]:lte=1M:jim >>> >>> >>> >>> Probably the most important thing you haven't told us is the current >>> size of your indexes. You said 20k/day input, but at the start do you >>> have 0days, 1 day, 10 days, 100 days, 1000 days, or 10000 days (27y) >>> on disk already? >>> >>> >>> >>> If you are starting from zero, then there is likely a 20x or more >>> growth in the size of the index between the first and second >>> measurement.. indexes do get slower with size though you would need >>> fantastically large documents or some sort of disk problem to explain it >> that way. >>> >>> >>> >>> However, maybe you do have huge documents or disk issues since your >>> query time at time1 is already abysmal? Either you are creating a >>> fantastically expensive query, or your system is badly overloaded. New >>> systems, properly sized with moderate sized documents ought to be >>> serving simple queries in tens of milliseconds. >>> >>> >>> >>> As others have said it is *critical you show us the entire query >> request*. >>> >>> If you are doing something like attempting to return the entire index >>> with rows=999999, that would almost certainly explain your issues... >>> >>> >>> >>> How large are your average documents (in terms of bytes)? >>> >>> >>> >>> Also what version of Solr? >>> >>> >>> >>> r5.xlarge only has 4 cpu and 32 GB of memory. That's not very large >>> (despite the name). However since it's unclear what your total index >>> size looks like, it might be OK. >>> >>> >>> >>> What are your IOPS constraints with EFS? Are you running out of a >>> quota there? (bursting mode?) >>> >>> >>> >>> Note that EFS is encrypted file system, and stunnel is encrypted >>> transport, so for each disk read you likely causing: >>> >>> >>> >>> - read raw encrypted data from disk to memory (at AWS) >>> >>> - decrypt the disk data in memory (at AWS) >>> >>> - encrypt the memory data for stunnel transport (at AWS) >>> >>> - send the data over the wire >>> >>> - decrypt the data for use by solr. (Hardware you specify) >>> >>> >>> >>> That's guaranteed to be slow, and worse yet, you have no control at >>> all over the size or loading of the hardware performing anything but >>> the last step. You are completely at the mercy of AWS's cost/speed >>> tradeoffs which are unlikely to be targeting the level of performance >>> usually desired for search disk IO. >>> >>> >>> >>> I'll also echo others and say that it's a bad idea to allow solr >>> instances to compete for disk IO in any way. I've seen people succeed >>> with setups that use invisibly provisioned disks, but one typically >>> has to run more hardware to compensate. Having a shared disk creates >>> competition, and it also creates a single point of failure partially >>> invalidating the notion of running 3 servers in cloud mode for high >>> availability. If you can't have more than one disk, then you might as >>> well run a single node, especially at small data sizes like 20k/day. >>> A single node on well chosen hardware can usually serve tens of >>> millions of normal sized documents, which would be several years of >>> data for you. (assuming low query rates, handling high rates of course >>> starts to require hardware) >>> >>> >>> >>> Finally, you will want to get away from using single queries as a >>> measurement of latency. If you care about response time I HIGHLY >>> suggest you watch this YouTube video on how NOT to measure latency: >>> >>> https://www.youtube.com/watch?v=lJ8ydIuPFeU >>> >>> >>> >>> On Fri, Feb 23, 2024 at 6:44 PM Jan Høydahl <[email protected]> >> wrote: >>> >>> >>> >>>> I think EFS is a terribly slow file system to use for Solr, who >>> >>>> recommended it? :) Better use one EBS per node. >>> >>>> Not sure if the gradually slower performance is due to EFS though. >>>> We >>> >>>> need to know more about your setup to get a clue. What role does >>> >>>> stunnel play here? How are you indexing the content etc. >>> >>>> >>> >>>> Jan >>> >>>> >>> >>>>> 23. feb. 2024 kl. 19:58 skrev Walter Underwood >>>>> <[email protected] >>>> : >>> >>>>> >>> >>>>> First, a shared disk is not a good idea. Each node should have its >>> >>>>> own >>> >>>> local disk. Solr makes heavy use of the disk. >>> >>>>> >>> >>>>> If the indexes are shared, I’m surprised it works at all. Solr is >>> >>>>> not >>> >>>> designed to share indexes. >>> >>>>> >>> >>>>> Please share the full query string. >>> >>>>> >>> >>>>> wunder >>> >>>>> Walter Underwood >>> >>>>> [email protected] >>> >>>>> http://observer.wunderwood.org/ (my blog) >>> >>>>> >>> >>>>>> On Feb 23, 2024, at 10:01 AM, Beale, Jim (US-KOP) >>> >>>> <[email protected]> wrote: >>> >>>>>> >>> >>>>>> I have a Solrcloud installation of three servers on three >>>>>> r5.xlarge >>> >>>>>> EC2 >>> >>>> with a shared disk drive using EFS and stunnel. >>> >>>>>> >>> >>>>>> I have documents coming in about 20000 per day and I am trying to >>> >>>> perform indexing along with some regular queries and some special >>> >>>> queries for some new functionality. >>> >>>>>> >>> >>>>>> When I just restart Solr, these queries run very fast but over >>>>>> time >>> >>>> become slower and slower. >>> >>>>>> >>> >>>>>> This is typical for the numbers. At time1, the request only took >>> >>>>>> 2.16 >>> >>>> sec but over night the response took 18.137 sec. That is just typical. >>> >>>>>> >>> >>>>>> businessId, all count, reduced count, time1, time2 >>> >>>>>> 7016274253,8433,4769,2.162,18.137 >>> >>>>>> >>> >>>>>> The same query is so far different. Overnight the Solr servers >>>>>> slow >>> >>>> down and give terrible response. I don’t even know if this list is >> alive. >>> >>>>>> >>> >>>>>> >>> >>>>>> Jim Beale >>> >>>>>> Lead Software Engineer >>> >>>>>> hibu.com >>> >>>>>> 2201 Renaissance Boulevard, King of Prussia, PA, 19406 >>> >>>>>> Office: 610-879-3864 >>> >>>>>> Mobile: 610-220-3067 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> The information contained in this email message, including any >>> >>>> attachments, is intended solely for use by the individual or entity >>> >>>> named above and may be confidential. If the reader of this message >>>> is >>> >>>> not the intended recipient, you are hereby notified that you must >>>> not >>> >>>> read, use, disclose, distribute or copy any part of this >>> >>>> communication. If you have received this communication in error, >>> >>>> please immediately notify me by email and destroy the original >>>> message, >>> including any attachments. Thank you. >>> >>>> **Hibu IT Code:1414593000000** >>> >>>>> >>> >>>> >>> >>>> >>> >>> >>> >>> -- >>> >>> http://www.needhamsoftware.com (work) >>> >>> https://a.co/d/b2sZLD9 (my fantasy fiction book) >>> >> >> >> -- >> http://www.needhamsoftware.com (work) >> https://a.co/d/b2sZLD9 (my fantasy fiction book) >> > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book)
