Re: [lustre-discuss] CentOS / LTS plans
Hi Andrew Yes you've got the plan of record correct - any future 2.12.x releases will stick with RHEL 7.x servers and when the LTS branch shifts to something newer that will include a jump to RHEL 8 \servers. As an aside, note that people do maintain patch series for server support for other distributions (SLES/Ubuntu), particularly on the more current releases, but the bulk of the testing is focused on RHEL/Centos servers. There will definitely be plenty of notice before we make the shift and we discussed on the LWG meeting earlier today about including a question about this in the community survey in the new year to poll people's opinions on this topic. Peter On 2020-12-10, 7:20 PM, "lustre-discuss on behalf of Andrew Elwell" wrote: Hi All, I'm guessing most of you have heard of the recent roadmap for CentOS (discussion of which isn't on topic for this list), but can we have a vague (happy for it to be "at this point we're thinking about X, but we haven't really decided" level) indication of what the plan for the upcoming releases are likely to be? Thanks for the 2.12.6 update the other day - that's on this afternoon's plan to get it on our testbed and I see from Peter's mail that 2.12.7 will be the next LTS release. Will this likely be using RHEL 7.x for server again? Are the remaining 2.12.x LTS releases likely to stick with RHEL 7 for server? Is the "next big branch" LTS release (whatever that may be) likely to be based on RHEL 8 for server? Many thanks Andrew (who's trying to work out what licence purchases we're likely to need to include in storage plans) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] CentOS / LTS plans
Hi All, I'm guessing most of you have heard of the recent roadmap for CentOS (discussion of which isn't on topic for this list), but can we have a vague (happy for it to be "at this point we're thinking about X, but we haven't really decided" level) indication of what the plan for the upcoming releases are likely to be? Thanks for the 2.12.6 update the other day - that's on this afternoon's plan to get it on our testbed and I see from Peter's mail that 2.12.7 will be the next LTS release. Will this likely be using RHEL 7.x for server again? Are the remaining 2.12.x LTS releases likely to stick with RHEL 7 for server? Is the "next big branch" LTS release (whatever that may be) likely to be based on RHEL 8 for server? Many thanks Andrew (who's trying to work out what licence purchases we're likely to need to include in storage plans) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Metrics Gathering into ELK stack
Sid Young writes: > Are there any reliable/solid Lustre Specific metrics tools that can > push data to ELK OR Can generate JSON strings of metrics I can push > into more bespoke monitoring solutions... > > I am more interested in I/O metrics from the lustre side of things as > I can gather Disk/CPU/memory metrics with Metricbeat as needed already > in the legacy HPC. Lustre Job Stats [0] may provide some or all of what you are looking for. The job stats data are output in yaml format, which are fairly easy to transform to inputs for Elasticsearch (or more generally as JSON for other systems). In our case we used Python. The yaml inputs are imported as Python dictionaries, which can then be used directly as input data to the Elasticsearch Python module. We happen to add some additional entries to the dictionary objects before submitting them to Elasticsearch. We also find it useful to clear the job stats after each read to simplify analysis. I do not have any publicly-available code to share, but I think the Python implementation is not overly complex. [0] https://doc.lustre.org/lustre_manual.xhtml#dbdoclet.jobstats -- Nathan Smith Research Systems Engineer Advanced Computing Center Oregon Health & Science University ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] lustre babes
Great - we're always happy to have young blood in the Lustre development community ( You're not the first person to report this but are the first one to post to the mailing lists about it. The comment is not a mistake but was intended as a joke. On 2020-12-10, 6:40 AM, "lustre-discuss on behalf of Steve Thompson" wrote: At the foot of entries in the Whamcloud Code Review site, it says: "You must be at least 18 months old to use the Whamcloud Code Review site." My grand-daughter is 3 years old now and is quite bright; can she begin a code review once she has finished with Quantum Mechanics? Steve -- Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 3901 N Charles St VSW Support: support AT vgersoft DOT com Baltimore MD 21218 "186,282 miles per second: it's not just a good idea, it's the law" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] lustre babes
At the foot of entries in the Whamcloud Code Review site, it says: "You must be at least 18 months old to use the Whamcloud Code Review site." My grand-daughter is 3 years old now and is quite bright; can she begin a code review once she has finished with Quantum Mechanics? Steve -- Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 3901 N Charles St VSW Support: support AT vgersoft DOT com Baltimore MD 21218 "186,282 miles per second: it's not just a good idea, it's the law" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Robinhood scan time
Hello, We had very bad perf with both rbh scans and changelogs, our robinhood daemon is running on a baremetal server with lot of ram, multi-cpu and fast ssd. Thanks to Sebastien Piechurski we discovered that the assignment of the process was the main reason of the bad performances. To fix it : - lustre has been assign to the cpu in charge of the hba - rbh and mysqld has been assign to the same cpu Maybe i am out of topic, sorry if so... Regards Hervé - Mail original - De: "Iannetti, Gabriele" ianne...@gsi.de> À: "Kumar, Amit" Cc: "lustre-discuss" Envoyé: Mercredi 9 Décembre 2020 10:49:08 Objet: Re: [lustre-discuss] Robinhood scan time Hi Amit, we also faced very slow full scan performance before. As was mentioned before by Aurélien it is essential to investigate the processing stages within the Robinhood logs. In our setup the GET_FID stage was the bottleneck, since the stage had a relatively low total number of entries processed more often. So increasing the number of nb_threads_scan helped. Of course other stages e.g. DB_APPLY with relatively low total number of entries processed can indicate a bottleneck on the database. So you have to keep in mind that there are multiple layers to take into consideration for performance tuning. For running multiple file system scan tests you could consider doing a partial scan (with same test data) with Robinhood instead of scanning the hole file system, which will take much more time. I would like to share a diagram with you, where you can see a comparision with nb_threads_scan 64 vs 2. This was the maximum we have tested so far. In the production system the number is set to 48. Since more is not always better. As far as I can remember we hit issues with the main memory then. Best regards Gabriele From: lustre-discuss on behalf of Degremont, Aurelien Sent: Tuesday, December 8, 2020 10:39 To: Kumar, Amit; Stephane Thiell Cc: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Robinhood scan time There could be lots of difference between these 2 systems. - What is the backend FS type? (ZFS or LDiskfs) - How many MDT do you have? - Is 2 threads enough to maximum your scan throughput? Stephane said he used 4 and 8 of them. - What the workload running on the MDT at the same time, is it overloaded already by your users' jobs? Robinhood is also dumping its pipeline stats regularly in the logs. You can spot which step of the pipeline is slowing you down. Aurélien Le 07/12/2020 20:59, « Kumar, Amit » a écrit : Hi Stephane & Aurélien Here are the stats that I see in my logs: Below is the best and worst avg. speed I noted in the log, with nb_threads_scan=2: 2020/11/03 16:51:04 [4850/3] STATS | avg. speed (effective): 618.32 entries/sec (3.23 ms/entry/thread) 2020/11/25 18:06:10 [4850/3] STATS | avg. speed (effective): 187.93 entries/sec (10.62 ms/entry/thread) Finally the full scan results are below: 2020/11/25 17:13:41 [4850/4] FS_Scan | Full scan of /scratch completed, 369729104 entries found (123 errors). Duration = 1964257.21s Stephane, now I wonder what could have caused poor scanning performance. Once I kicked off my initial scan during the LAD with same number of threads(2) my scan along with some users jobs in the following days caused opening and closing of file 150-200 million file operations and as a result filled up my change log too soon than I expected. I had to cancel the first initial scan to bring the situation under control. After I cleared change log, I asked Robinhood to perform a new full scan. I am not sure if this cancel and restart could have caused delays with additional lookup into database for existing entries of already scanned 200millions files by then? Other thing your point out is you have RAID10 SSD, on our end I have RAID-5 3.6TB of SSD's, this probably explains the slowness? I wasn't sure of the impact of the scan hence chose only 2 threads, I am guessing I could bump that up to 4 next times to see if the benefits my scan times. Thank you, Amit -Original Message- From: Stephane Thiell Sent: Monday, December 7, 2020 11:43 AM To: Degremont, Aurelien Cc: Kumar, Amit ; Russell Dekema ; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Robinhood scan time Hi Amit, Your number is very low indeed. At our site, we're seeing ~100 million files/day during a Robinhood scan with nb_threads_scan =4 and on hardware using Intel based CPUs: 2020/11/16 07:29:46 [126653/2] STATS | avg. speed (effective): 1207.06 entries/sec (3.31 ms/entry/thread) 2020/11/16 07:31:44 [126653/29] FS_Scan | Full scan of /oak completed, 1508197871 entries found (65 errors). Duration = 1249490.23s In that case, our Lustre MDS and Robinhood server are running all on 2 x CPU E5-2643 v3 @ 3.40GHz. The Robin