[robinhood-support] Robinhood performance

Jim Silva Wed, 26 Mar 2014 11:45:48 -0700

We continue to face challenges using robinhood to scan our large lustre 
filesystems here at LLNL.  Currently, LLNL's lustre filesystems range from 150M 
files up to 750M files (and growing).  With a recent robinhood update (2.5.0) 
we have been forced to perform another initial scan of these filesystems due to 
a schema change.  We are finding that on our most heavily utilized filesystems 
we are only seeing scan rates of about 50-300 entries/sec on average.  At this 
rate, we are looking at several weeks before the scan completes (>500M files).  
The robinhood entry processor pipeline stats often suggests there are no 
operations in wait status and all worker threads are idle.  I suspect we are 
being limited by lustre performance and not robinhood in this case but I wanted 
to run things by you, just in case there are any tuning improvements we can 
benefit from.  We have also been looking into the use of multiple lustre 
clients (using robinhood's "partial scan" functionality), splitting up 
 the namespace and farming out scans using slurm.  This would increase our 
client count and, presumably, improve our metadata performance.


Below are some high-level details of our robinhood configuration:

Robinhood hardware:
Dual Xeon E5-2670 (Sandy Bridge) w/ 384GB RAM
Database written to 8GB/s fibre-channel attached Netapp array.

Software:
RHEL6-U4
Lustre-2.4.0
robinhood-2.5.0

EntryProcessor
{
  # nbr of worker threads for processing pipeline tasks
  nb_threads = 12 ;

  # Max number of operations in the Entry Processor pipeline.
  # If the number of pending operations exceeds this limit,
  # info collectors are suspended until this count decreases
  max_pending_operations = 10000 ;

  max_batch_size = 1000 ;

  # Optionnaly specify a maximum thread count for each stage of the pipeline:
  # <stagename>_threads_max = <n> (0: use default)
  # STAGE_GET_FID_threads_max = 8 ;
  # STAGE_GET_INFO_DB_threads_max = 8 ;
  STAGE_GET_INFO_FS_threads_max = 8 ;
  # STAGE_REPORTING_threads_max = 8 ;
  # STAGE_DB_APPLY_threads_max = 8 ;

  # if set to FALSE, classes will only be matched
  # at policy application time (not during a scan or reading changelog)
  match_classes = FALSE;
}


FS_Scan
{ 
  # simple scan interval (fixed)
  scan_interval = 2d ;

  # min/max for adaptive scan interval:
  # the more the filesystem is full, the more frequently it is scanned.
  #min_scan_interval = 24h ;
  #max_scan_interval = 7d ;

  # number of threads used for scanning the filesystem
  nb_threads_scan = 12 ;

  # when a scan fails, this is the delay before retrying
  scan_retry_delay = 1h ;

  # timeout for operations on the filesystem
  scan_op_timeout = 1h ;
  # exit if operation timeout is reached?
  exit_on_timeout = TRUE ;
  # external command called on scan termination
  # special arguments can be specified: {cfg} = config file path,
  # {fspath} = path to managed filesystem
  #completion_command = "/path/to/my/script.sh -f {cfg} -p {fspath}" ;

  # Internal scheduler granularity (for testing and of scan, hangs, ...)
  spooler_check_interval = 1min ;

  # Memory preallocation parameters
  nb_prealloc_tasks = 256 ;
}



Recent stats of scan currently in progress:

2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | ==================== 
Dumping stats at 2014/03/26 10:28:45 =====================
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | ======== General 
statistics =========
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | Daemon start time: 
2014/03/20 19:40:29
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | Started modules: scan
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | ======== FS scan 
statistics =========
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | current scan interval 
= 2.0d
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | scan is running:
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |      started at : 
2014/03/20 19:40:29 (5.6d ago)
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |      last action: 
2014/03/26 10:28:45 (00s ago)
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |      progress   : 
88563083 entries scanned (225 errors)
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |      avg. speed : 
43.62 ms/entry/thread -> 183.40 entries/sec
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |      inst. speed: 
30.44 ms/entry/thread -> 262.77 entries/sec
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | ==== EntryProcessor 
Pipeline Stats ===
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | Idle threads: 12
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | Id constraints count: 
0 (hash min=0/max=0/avg=0.0)
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | Stage              | 
Wait | Curr | Done |     Total | ms/op |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  0: GET_FID        |   
 0 |    0 |    0 |    302725 | 27.93 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  1: GET_INFO_DB    |   
 0 |    0 |    0 |  88742078 |  0.19 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  2: GET_INFO_FS    |   
 0 |    0 |    0 |  88742078 |  3.66 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  3: REPORTING      |   
 0 |    0 |    0 |  88742078 |  0.00 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  4: PRE_APPLY      |   
 0 |    0 |    0 |  88742078 |  0.00 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  5: DB_APPLY       |   
 0 |    0 |    0 |  88742078 |  0.60 | 16.69% batched (avg batch size: 7.0)
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  6: CHGLOG_CLR     |   
 0 |    0 |    0 |         0 |  0.00 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS |  7: RM_OLD_ENTRIES |   
 0 |    0 |    0 |         0 |  0.00 |
2014/03/26 10:28:45 robinhood@locksley[90120/4]: STATS | DB ops: 
get=290891/ins=7249559/upd=81492519/rm=0

Just from a robinhood tuning perspective, do you see any improvements we 
can/should make in robinhood (i.e. number of scan threads, pipeline/worker 
threads, batching vs. threading, etc.)?
Let me know if there is any additional information you would like.

Thanks, in advance, for any suggestions you may have.

Regards,
Jim

=============================
Jim Silva
HPC Systems Engineer
Lawrence Livermore National Laboratory
[email protected]

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

[robinhood-support] Robinhood performance

Reply via email to