Re: [lustre-discuss] CentOS / LTS plans

2020-12-10 Thread Peter Jones
Hi Andrew

Yes you've got the plan of record correct - any future 2.12.x releases will 
stick with RHEL 7.x servers and when the LTS branch shifts to something newer 
that will include a jump to RHEL 8 \servers. As an aside, note that people do 
maintain patch series for server support for other distributions (SLES/Ubuntu), 
particularly on the more current releases, but the bulk of the testing is 
focused on RHEL/Centos servers. There will definitely be plenty of notice 
before we make the shift and we discussed on the LWG meeting earlier today 
about including a question about this in the community survey in the new year 
to poll people's opinions on this topic.

Peter

On 2020-12-10, 7:20 PM, "lustre-discuss on behalf of Andrew Elwell" 
 
wrote:

Hi All,

I'm guessing most of you have heard of the recent roadmap for CentOS
(discussion of which isn't on topic for this list), but can we have a
vague (happy for it to be "at this point we're thinking about X, but
we haven't really decided" level) indication of what the plan for the
upcoming releases are likely to be?

Thanks for the 2.12.6 update the other day - that's on this
afternoon's plan to get it on our testbed and I see from Peter's mail
that 2.12.7 will be the next LTS release. Will this likely be using
RHEL 7.x for server again?

Are the remaining 2.12.x LTS releases likely to stick with RHEL 7 for 
server?

Is the "next big branch" LTS release (whatever that may be) likely to
be based on RHEL 8 for server?



Many thanks

Andrew (who's trying to work out what licence purchases we're likely
to need to include in storage plans)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] CentOS / LTS plans

2020-12-10 Thread Andrew Elwell
Hi All,

I'm guessing most of you have heard of the recent roadmap for CentOS
(discussion of which isn't on topic for this list), but can we have a
vague (happy for it to be "at this point we're thinking about X, but
we haven't really decided" level) indication of what the plan for the
upcoming releases are likely to be?

Thanks for the 2.12.6 update the other day - that's on this
afternoon's plan to get it on our testbed and I see from Peter's mail
that 2.12.7 will be the next LTS release. Will this likely be using
RHEL 7.x for server again?

Are the remaining 2.12.x LTS releases likely to stick with RHEL 7 for server?

Is the "next big branch" LTS release (whatever that may be) likely to
be based on RHEL 8 for server?



Many thanks

Andrew (who's trying to work out what licence purchases we're likely
to need to include in storage plans)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Metrics Gathering into ELK stack

2020-12-10 Thread Nathan Smith


Sid Young writes:
> Are there any reliable/solid Lustre Specific metrics tools that can
> push data to ELK OR Can generate JSON strings of metrics I can push
> into more bespoke monitoring solutions...
>
> I am more interested in I/O metrics from the lustre side of things as
> I can gather Disk/CPU/memory metrics with Metricbeat as needed already
> in the legacy HPC.

Lustre Job Stats [0] may provide some or all of what you are looking
for. The job stats data are output in yaml format, which are fairly easy
to transform to inputs for Elasticsearch (or more generally as JSON for
other systems). In our case we used Python. The yaml inputs are imported
as Python dictionaries, which can then be used directly as input data to
the Elasticsearch Python module.

We happen to add some additional entries to the dictionary objects
before submitting them to Elasticsearch. We also find it useful to clear
the job stats after each read to simplify analysis. I do not have any
publicly-available code to share, but I think the Python implementation
is not overly complex.

[0] https://doc.lustre.org/lustre_manual.xhtml#dbdoclet.jobstats

-- 
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre babes

2020-12-10 Thread Peter Jones
Great - we're always happy to have young blood in the Lustre development 
community ( You're not the first person to report this but are the first one to 
post to the mailing lists about it. The comment is not a mistake but was 
intended as a joke. 

On 2020-12-10, 6:40 AM, "lustre-discuss on behalf of Steve Thompson" 
 wrote:

At the foot of entries in the Whamcloud Code Review site, it says:

"You must be at least 18 months old to use the Whamcloud Code Review 
site."

My grand-daughter is 3 years old now and is quite bright; can she begin a 
code review once she has finished with Quantum Mechanics?

Steve
-- 

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
   "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lustre babes

2020-12-10 Thread Steve Thompson

At the foot of entries in the Whamcloud Code Review site, it says:

"You must be at least 18 months old to use the Whamcloud Code Review 
site."


My grand-daughter is 3 years old now and is quite bright; can she begin a 
code review once she has finished with Quantum Mechanics?


Steve
--

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
  "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood scan time

2020-12-10 Thread Hervé Toureille
Hello,

We had very bad perf with both rbh scans and changelogs,
our robinhood daemon is running on a baremetal server 
with lot of ram, multi-cpu and fast ssd. 

Thanks to Sebastien Piechurski we discovered that the assignment 
of the process was the main reason of the bad performances.

To fix it : 
- lustre has been assign to the cpu in charge of the hba 
- rbh and mysqld has been assign to the same cpu

Maybe i am out of topic, sorry if so...

Regards
Hervé



- Mail original -
De: "Iannetti, Gabriele" ianne...@gsi.de>
À: "Kumar, Amit" 
Cc: "lustre-discuss" 
Envoyé: Mercredi 9 Décembre 2020 10:49:08
Objet: Re: [lustre-discuss] Robinhood scan time

Hi Amit,

we also faced very slow full scan performance before.

As was mentioned before by Aurélien it is essential to investigate the 
processing stages within the Robinhood logs.

In our setup the GET_FID stage was the bottleneck, since the stage had a 
relatively low total number of entries processed more often.
So increasing the number of nb_threads_scan helped.

Of course other stages e.g. DB_APPLY with relatively low total number of 
entries processed can indicate a bottleneck on the database.
So you have to keep in mind that there are multiple layers to take into 
consideration for performance tuning.

For running multiple file system scan tests you could consider doing a partial 
scan (with same test data) with Robinhood instead of scanning the hole file 
system, which will take much more time.

I would like to share a diagram with you, where you can see a comparision with 
nb_threads_scan 64 vs 2.
This was the maximum we have tested so far. In the production system the number 
is set to 48.
Since more is not always better. As far as I can remember we hit issues with 
the main memory then.

Best regards
Gabriele




From: lustre-discuss  on behalf of Degremont, Aurelien 
Sent: Tuesday, December 8, 2020 10:39
To: Kumar, Amit; Stephane Thiell
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Robinhood scan time

There could be lots of difference between these 2 systems.
- What is the backend FS type? (ZFS or LDiskfs)
- How many MDT do you have?
- Is 2 threads enough to maximum your scan throughput? Stephane said he used 4 
and 8 of them.
- What the workload running on the MDT at the same time, is it overloaded 
already by your users' jobs?

Robinhood is also dumping its pipeline stats regularly in the logs. You can 
spot which step of the pipeline is slowing you down.

Aurélien

Le 07/12/2020 20:59, « Kumar, Amit »  a écrit :


Hi Stephane & Aurélien

Here are the stats that I see in my logs:

Below is the best and worst avg. speed I noted in the log, with 
nb_threads_scan=2:
2020/11/03 16:51:04 [4850/3] STATS |  avg. speed  (effective):
618.32 entries/sec (3.23 ms/entry/thread)
2020/11/25 18:06:10 [4850/3] STATS |  avg. speed  (effective):
187.93 entries/sec (10.62 ms/entry/thread)

Finally the full scan results are below:
2020/11/25 17:13:41 [4850/4] FS_Scan | Full scan of /scratch completed, 
369729104 entries found (123 errors). Duration = 1964257.21s

Stephane, now I wonder what could have caused poor scanning performance. 
Once I kicked off my initial scan during the LAD with same number of threads(2) 
my scan along with some users jobs in the following days caused opening and 
closing of file 150-200 million file operations and as a result filled up my 
change log too soon than I expected.  I had to cancel the first initial scan to 
bring the situation under control. After I cleared change log, I asked 
Robinhood to perform a new full scan. I am not sure if this cancel and restart 
could have caused delays with additional lookup into database for existing 
entries of already scanned 200millions files by then? Other thing your point 
out is you have RAID10 SSD, on our end I have RAID-5 3.6TB of SSD's, this 
probably explains the slowness?

I wasn't sure of the impact of the scan hence chose only 2 threads, I am 
guessing I could bump that up to 4 next times to see if the benefits my scan 
times.

Thank you,
Amit

-Original Message-
From: Stephane Thiell 
Sent: Monday, December 7, 2020 11:43 AM
To: Degremont, Aurelien 
Cc: Kumar, Amit ; Russell Dekema ; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Robinhood scan time

Hi Amit,

Your number is very low indeed.

At our site, we're seeing ~100 million files/day during a Robinhood scan 
with nb_threads_scan =4 and on hardware using Intel based CPUs:

2020/11/16 07:29:46 [126653/2] STATS |  avg. speed  (effective):   
1207.06 entries/sec (3.31 ms/entry/thread)

2020/11/16 07:31:44 [126653/29] FS_Scan | Full scan of /oak completed, 
1508197871 entries found (65 errors). Duration = 1249490.23s

In that case, our Lustre MDS and Robinhood server are running all on 2 x 
CPU E5-2643 v3 @ 3.40GHz.
The Robin