Re: [lustre-discuss] Trying to compile lustre 2.8 against kernel 4.5

2016-05-22 Thread Drokin, Oleg
Hello!

On May 22, 2016, at 2:38 PM, E.S. Rosenberg wrote:

> Internal functions have changed and as a result it is currently not 
> compiling, what is the accepted style for fixing these things?
> 
> Right now I am looking at this error:
> lustre-release/lustre/fid/lproc_fid.c: In function 
> ‘lprocfs_client_fid_space_seq_show’:
> lustre-release/lustre/fid/lproc_fid.c:542:5: error: void value not ignored as 
> it ought to be
>   rc = seq_printf(m, "["LPX64" - "LPX64"]:%x:%s\n",
> 
> which may be straight forward to fix or may not be depending on how 
> lprocfs_client_fid_space_seq_show is used.
> 
> seq_printf used to return 0 or -1 which would in turn be returned by this 
> function, now seq_printf does not return anything anymore.

Just open a ticket in jira.hpdd.intel.com describing the problem either 
especifically for every problem, or in general like "support for Linux kernel 
4.5"
Then submit patch to review.hpdd.intel.com with the fix, reference the ticket.
Don't forget to add necessary compat logic to configure and elsewhere when 
needed (though I guess we can just irnore return value from
seq_printf everywhere with no ill effects?

See https://wiki.hpdd.intel.com/display/PUB/Submitting+Changes and 
https://wiki.hpdd.intel.com/display/PUB/Patch+Landing+Process+Summary

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] more on lustre striping

2016-05-21 Thread Drokin, Oleg
$ nm -g lustre/liblustre/liblustre.so | grep open
00237820 W __open
00237820 W __open64
0023ba30 W __opendir
002376c0 T _sysio_open
 U fopen@@GLIBC_2.2.5
00237820 T open
00237820 W open64
0023ba30 T opendir


These are the open symbols we have in the .so
it most certainly intercepts the open syscall
no matter if it comes via open or fopen.

so I suspect you just need to catch __open* stuff and
this will catch both open and fopen for you too.
At least quick googling around seems to confirm this.

All intercepting was done via libsysio (since it reimplemented VFS in 
userspace),
so if you need more info, perhaps you can consult with Lee Ward who is the main
author of it.
I know he used to read this list too so he might decide to chime in.

On May 21, 2016, at 9:56 PM, John Bauer wrote:

> Oleg
> 
> I can intercept the fopen(), but that does me no good as I can't set the 
> O_LOV_DELAY_CREATE bit.  What I can not intercept is the open() downstream of 
> fopen().  If one examines the symbols in libc you will see there are no 
> unsatisfied externals relating to open, which means there is nothing for the 
> runtime linker to find concerning open's.  I will have a look at the Lustre 
> 1.8 source, but I seriously doubt that the open beneath fopen() was 
> intercepted with LD_PRELOAD.  I would love to find a way to do that.  I could 
> throw away a lot of code. Thanks,  John
> % nm -g /lib64/libc.so.6 | grep open
> 00033d70 T catopen
> 003bfb80 B _dl_open_hook
> 000b9a60 W fdopendir
> 0006b140 T fdopen@@GLIBC_2.2.5
> 000755c0 T fmemopen
> 0006ba00 W fopen64
> 0006bb60 T fopencookie@@GLIBC_2.2.5
> 0006ba00 T fopen@@GLIBC_2.2.5
> 000736f0 T freopen
> 00074b50 T freopen64
> 000ead40 T fts_open
> 0000 T iconv_open
> 0006b140 T _IO_fdopen@@GLIBC_2.2.5
> 00077220 T _IO_file_fopen@@GLIBC_2.2.5
> 00077170 T _IO_file_open
> 0006ba00 T _IO_fopen@@GLIBC_2.2.5
> 0006d1d0 T _IO_popen@@GLIBC_2.2.5
> 0006cee0 T _IO_proc_open@@GLIBC_2.2.5
> 00130b20 T __libc_dlopen_mode
> 000e7840 W open
> 000e7840 W __open
> 000ec690 T __open_2
> 000e7840 W open64
> 000e7840 W __open64
> 000ec6b0 T __open64_2
> 000e78d0 W openat
> 000e79b0 T __openat_2
> 000e78d0 W openat64
> 000e79b0 W __openat64_2
> 000f6e00 T open_by_handle_at
> 000340b0 T __open_catalog
> 000b9510 W opendir
> 000f0850 T openlog
> 00073e90 T open_memstream
> 000731b0 T open_wmemstream
> 0006d1d0 T popen@@GLIBC_2.2.5
> 0012fbd0 W posix_openpt
> 000e6460 T posix_spawn_file_actions_addopen
> %
> John
> 
> On 5/21/2016 7:33 PM, Drokin, Oleg wrote:
>> btw I find it strange that you cannot intercept fopen (and in fact 
>> intercepting every library call like that is counterproductive).
>> 
>> We used to have this "liblustre" library, that you an LD_PRELOAD into your 
>> application and it would work with Lustre even if you are not root and if 
>> Lustre is not mounted on that node
>> (and in fact even if the node is not Linux at all). That had no problems at 
>> all to intercept all sorts of opens by intercepting syscalls.
>> I wonder if you can intercept something deeper like sys_open or something 
>> like that?
>> Perhaps checkout lustre 1.8 sources (or even 2.1) and see how we did it back 
>> there?
>> 
>> On May 21, 2016, at 4:25 PM, John Bauer wrote:
>> 
>> 
>>> Oleg
>>> 
>>> So in my simple test, the second open of the file caused the layout to be 
>>> created.  Indeed, a write to the original fd did fail.
>>> That complicates things considerably.
>>> 
>>> Disregard the entire topic.
>>> 
>>> Thanks
>>> 
>>> John
>>> 
>>> 
>>> On 5/21/2016 3:08 PM, Drokin, Oleg wrote:
>>> 
>>>> The thing is, when you open a file with no layout (the one you cteate with 
>>>> P_LOB_DELAY_CREATE) for write the next time - 
>>>> the default layout is created just the same as it would have been created 
>>>> on the first open.
>>>> So if you want custom layouts - you do need to insert setstripe call 
>>>> between the creation and actual open for write.
>>>> 
>>>> On the other hand if you open with O_LOV_DELAY_CREATE and then try to 
>>>> write into that fd - you will get a failure.
>>>> 
>>&g

Re: [lustre-discuss] more on lustre striping

2016-05-21 Thread Drokin, Oleg
btw I find it strange that you cannot intercept fopen (and in fact intercepting 
every library call like that is counterproductive).

We used to have this "liblustre" library, that you an LD_PRELOAD into your 
application and it would work with Lustre even if you are not root and if 
Lustre is not mounted on that node
(and in fact even if the node is not Linux at all). That had no problems at all 
to intercept all sorts of opens by intercepting syscalls.
I wonder if you can intercept something deeper like sys_open or something like 
that?
Perhaps checkout lustre 1.8 sources (or even 2.1) and see how we did it back 
there?

On May 21, 2016, at 4:25 PM, John Bauer wrote:

> Oleg
> 
> So in my simple test, the second open of the file caused the layout to be 
> created.  Indeed, a write to the original fd did fail.
> That complicates things considerably.
> 
> Disregard the entire topic.
> 
> Thanks
> 
> John
> 
> 
> On 5/21/2016 3:08 PM, Drokin, Oleg wrote:
>> The thing is, when you open a file with no layout (the one you cteate with 
>> P_LOB_DELAY_CREATE) for write the next time - 
>> the default layout is created just the same as it would have been created on 
>> the first open.
>> So if you want custom layouts - you do need to insert setstripe call between 
>> the creation and actual open for write.
>> 
>> On the other hand if you open with O_LOV_DELAY_CREATE and then try to write 
>> into that fd - you will get a failure.
>> 
>> 
>> On May 21, 2016, at 4:01 PM, John Bauer wrote:
>> 
>> 
>>> Andreas,
>>> 
>>> Thanks for the reply.  For what it's worth, extending a file that does not 
>>> have layout set does work.
>>> 
>>> % rm -f file.dat
>>> % ./no_stripe.exe file.dat
>>> fd=3
>>> % lfs getstripe file.dat
>>> file.dat has no stripe info
>>> % date >> file.dat
>>> % lfs getstripe file.dat
>>> file.dat
>>> lmm_stripe_count:   1
>>> lmm_stripe_size:1048576
>>> lmm_pattern:1
>>> lmm_layout_gen: 0
>>> lmm_stripe_offset:  21
>>> obdidx   objid   objid   group
>>> 21 6143298   0x5dbd420
>>> 
>>> %
>>> The LD_PRELOAD is exactly what I am doing in my I/O library.  
>>> Unfortunately, one can not intercept the open() that results from a call to 
>>> fopen().  That open is hard linked to the open in libc and not satisfied by 
>>> the runtime linker.  This is what is driving this topic for me. I can not 
>>> conveniently set the striping for a file opened with fopen() and other 
>>> functions where the open is called from inside libc. I used to believe that 
>>> not too many application use stdio for heavy I/O, but I have been come 
>>> across several recently.
>>> 
>>> John
>>> 
>>> On 5/21/2016 12:51 AM, Dilger, Andreas wrote:
>>> 
>>>> This is probably getting to be more of a topic for lustre-devel. 
>>>> 
>>>> There currently isn't any way to do what you ask, since (IIRC) it will 
>>>> cause an error for apps that try to write to the files before the layout 
>>>> is set. 
>>>> 
>>>> What you could do is to create an LD_PRELOAD library to intercept the 
>>>> open() calls and set O_LOV_DELAY_CREATE and set the layout explicitly for 
>>>> each file. This might be a win if each file needs a different layout, but 
>>>> since it uses two RPCs per file it would be slower than using the default 
>>>> layout. 
>>>> 
>>>> Cheers, Andreas
>>>> 
>>>> On May 18, 2016, at 16:46, John Bauer 
>>>> 
>>>>  wrote:
>>>> 
>>>> 
>>>>> Since today's topic seems to be Lustre striping, I will revisit a 
>>>>> previous line of questions I had.
>>>>> 
>>>>> Andreas had put me on to O_LOV_DELAY_CREATE which I have been 
>>>>> experimenting with. My question is : Is there a way to flag a directory 
>>>>> with O_LOV_DELAY_CREATE so that a file created in that directory will be 
>>>>> created with O_LOV_DELAY_CREATE also.  Much like a file can inherit a 
>>>>> directory's stripe count and stripe size, it would be convenient if a 
>>>>> file could also inherit O_LOV_DELAY_CREATE?  That way, for open()s that I 
>>>>> can not intercept ( and thus can not set O_LOV_DELA

Re: [lustre-discuss] more on lustre striping

2016-05-21 Thread Drokin, Oleg
The thing is, when you open a file with no layout (the one you cteate with 
P_LOB_DELAY_CREATE) for write the next time - 
the default layout is created just the same as it would have been created on 
the first open.
So if you want custom layouts - you do need to insert setstripe call between 
the creation and actual open for write.

On the other hand if you open with O_LOV_DELAY_CREATE and then try to write 
into that fd - you will get a failure.


On May 21, 2016, at 4:01 PM, John Bauer wrote:

> Andreas,
> 
> Thanks for the reply.  For what it's worth, extending a file that does not 
> have layout set does work.
> 
> % rm -f file.dat
> % ./no_stripe.exe file.dat
> fd=3
> % lfs getstripe file.dat
> file.dat has no stripe info
> % date >> file.dat
> % lfs getstripe file.dat
> file.dat
> lmm_stripe_count:   1
> lmm_stripe_size:1048576
> lmm_pattern:1
> lmm_layout_gen: 0
> lmm_stripe_offset:  21
> obdidx   objid   objid   group
> 21 6143298   0x5dbd420
> 
> %
> The LD_PRELOAD is exactly what I am doing in my I/O library.  Unfortunately, 
> one can not intercept the open() that results from a call to fopen().  That 
> open is hard linked to the open in libc and not satisfied by the runtime 
> linker.  This is what is driving this topic for me. I can not conveniently 
> set the striping for a file opened with fopen() and other functions where the 
> open is called from inside libc. I used to believe that not too many 
> application use stdio for heavy I/O, but I have been come across several 
> recently.
> 
> John
> 
> On 5/21/2016 12:51 AM, Dilger, Andreas wrote:
>> This is probably getting to be more of a topic for lustre-devel. 
>> 
>> There currently isn't any way to do what you ask, since (IIRC) it will cause 
>> an error for apps that try to write to the files before the layout is set. 
>> 
>> What you could do is to create an LD_PRELOAD library to intercept the open() 
>> calls and set O_LOV_DELAY_CREATE and set the layout explicitly for each 
>> file. This might be a win if each file needs a different layout, but since 
>> it uses two RPCs per file it would be slower than using the default layout. 
>> 
>> Cheers, Andreas
>> 
>> On May 18, 2016, at 16:46, John Bauer  wrote:
>> 
>>> Since today's topic seems to be Lustre striping, I will revisit a previous 
>>> line of questions I had.
>>> 
>>> Andreas had put me on to O_LOV_DELAY_CREATE which I have been experimenting 
>>> with. My question is : Is there a way to flag a directory with 
>>> O_LOV_DELAY_CREATE so that a file created in that directory will be created 
>>> with O_LOV_DELAY_CREATE also.  Much like a file can inherit a directory's 
>>> stripe count and stripe size, it would be convenient if a file could also 
>>> inherit O_LOV_DELAY_CREATE?  That way, for open()s that I can not intercept 
>>> ( and thus can not set O_LOV_DELAY_CREATE in oflags) , such as those issued 
>>> by fopen(), I can then get the fd with fileno() and set the striping with 
>>> ioctl(fd, LL_IOC_LOV_SETSTRIPE, lum).
>>> 
>>> Thanks
>>> 
>>> John
>>> -- 
>>> I/O Doctors, LLC
>>> 507-766-0378
>>> 
>>> bau...@iodoctors.com
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> -- 
> I/O Doctors, LLC
> 507-766-0378
> 
> bau...@iodoctors.com
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.8.0 released

2016-03-28 Thread Drokin, Oleg
We are definitely not planning to replace the existing RPMs to avoid any 
confusion.

We are debating other ways of relieving the situation without waiting for 2.9 
and we'll update the list once the fix is really ready.
Right now it has not even landed to master as you can see.

On Mar 28, 2016, at 5:27 PM, Bob Ball wrote:

> Thanks.
> 
> Are you going to replace the 2.8 rpm sets in your repos, or just leave them 
> that way until 2.9?  Yeah, that is a provocative question.  It just seems to 
> me that it is such a fundamental thing, that a new rpm set should be created 
> for distribution.
> 
> bob
> 
> On 3/28/2016 4:20 PM, Glossman, Bob wrote:
>> Bob,
>> This is a known issue.  A fix is already in flight, see LU-7887 
>> https://jira.hpdd.intel.com/browse/LU-7887.
>> 
>> In the short term this can be worked around by building your own from source.
>> It is strictly an issue of our build framework.
>> Doesn’t impact on manual builds at all.
>> 
>> Bob Glossman
>> HPDD Software Engineer
>> 
>> 
>> 
>> 
>> On 3/28/16, 1:11 PM, "lustre-discuss on behalf of Bob Ball" 
>>  wrote:
>> 
>>> Has anyone else noticed that the lnetctl command is missing from this
>>> rpm set?  I mean, an entire chapter of the manual dedicated to this
>>> command, and it is not present?
>>> 
>>> Will this be fixed?
>>> 
>>> bob
>>> 
>>> On 3/16/2016 6:13 PM, Jones, Peter A wrote:
 We are pleased to announce that the Lustre 2.8.0 Release has been declared 
 GA and is available for 
  
 download 
  . You can 
 also grab the source from 
 
  
 git
 
 This major release includes new features:
 
 Distributed Namespace (DNE) Asynchronous Commit of cross-MDT updates for 
 improved performance. Remote rename and remote hard link functionality. 
 This completes the work funded by OpenSFS to allow the usage of multiple 
 metadata servers (LU-3534)
 
 LFSCK Phase 4 Performance and efficiency improvements to the online 
 filesystem consistency checker. This completes this OpenSFS-funded work  
 (LU-6361)
 Red Hat 7.x Server Support  This release offers support for both servers 
 and clients with RHEL 7.2 
 (LU-5022)
 
 SE Linux support for Lustre client  Added the capability to enforce SE 
 Linux security policies for Lustre clients. This work was contributed by 
 Atos. (LU-5560)
 Multiple Metadata RPCs  Support of multiple metadata modifications per 
 client (in last_rcvd file) to improve the multi-threaded metadata 
 performance of a single client. This work was contributed by Atos. 
 (LU-5319)
 Fuller details can be found in the 2.8 wiki 
 page (including the change 
 log and test 
 matrix)
 The following are known issues in the Lustre 2.8 Release:
 
 LU-7404 – Running with ZFS 
 0.6.5 and newer versions can result in client evictions while under load 
 conditions. This is due to an upstream ZFS 
 issue
 LU-7836 – Performing failover 
 with multiple MDTs per MDS can result in excessive memory consumption . It 
 is recommended to deploy with a single MDT per MDS until a fix is in place 
 for this issue
 Work is in progress for these issues.
 
 Please log any issues found in the issue tracking 
 system
 We would like to thank OpenSFS, for their 
 contributions towards the cost of the release and also to all Lustre 
 community members who have contributed to the release with code, reviews 
 or testing.
 
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
 
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lust

Re: [lustre-discuss] Questions about migrate OSTs from ldiskfs to zfs

2016-03-01 Thread Drokin, Oleg

On Mar 1, 2016, at 4:14 PM, Christopher J. Morrone wrote:

> On 03/01/2016 09:18 AM, Alexander I Kulyavtsev wrote:
> 
>> is tag 2.5.3.90 considered stable?
> 
> No.  Generally speaking you do not want to use anything with number 50
> or greater for the fourth number unless you are helping out with testing
> during the development process.

I think you are mixing up things and it is the 3rd number at 50 or above
that is the development code.


> 2.5.3 was the last official release on branch b2_5 before it was
> discontinued.


In this case 2.5.3.90 is "almost" 2.5.4 but not quite.
We wanted to have a tag in b2_5 before commits there ceased so that we can refer
to it by a version number vs the "tip of b2_5".
It contains various fixes on top of 2.5.3, but as far as I know, it did not
undergo the actual release testing that the point release would normally 
undergo.

Since the b2_5 at the time was a maintenance branch, we mostly tried to place
important fixes there that should not have broken anything.
But due to lack of proper release testing this could not be guaranteed by us,
so using it is still a bit of a leap of faith, but not quite as much as say
2.5.51.0 or 2.6.55.0 or the like.

Bye,
Oleg

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] STOP'd processes on Lustre clients while OSS/OST unavailable?

2016-02-19 Thread Drokin, Oleg
Hello!

   Actually I have to disagree.
   If the servers go down, but then go up and complete the recovery 
succesfully, the locks would be replayed and it all should work transparently.
   Clients would 'pause" trying to access those servers for as long as needed 
until the servers come back again.

   Also, file descriptors is something between MDS and clients so if an OST 
goes down, file descriptors would not be affected.

   That said, leaving MDS up while some OSTs go down for potentially prolonged 
time is not that great of an idea and it might make sense to deactivate those 
OSTs on MDS (before bringing OSTs down)
   (and reactivate them once they are back).

Bye,
Oleg
On Feb 19, 2016, at 2:53 PM, Patrick Farrell wrote:

> Paul,
> 
> I would say this is not very likely to work and could easily result in 
> corrupted data.  With the servers going down completely, the clients will 
> lose the locks they had (no possibility of recovery with the servers down 
> completely like this), and any data not written out will be lost.  You can 
> guarantee the processes are idle with SIGSTOP, yes, but you can't guarantee 
> all of the data has been written out.
> 
> There are other possible issues as well, but I don't think it's necessary to 
> detail them all.  I would strongly advise against this plan - Just truly stop 
> activity on the clients and unmount Lustre (to be certain), then remount it 
> after the maintenance is complete.
> 
> - Patrick
> On 02/19/2016 01:45 PM, Paul Brunk wrote:
>> Hi all:
>> 
>> We have a Linux cluster (CentOS 6.5, Lustre 1.8.9-wcl) which mounts a
>> Lustre FS from CentOS-based server appliance (Lustre 2.1.0).
>> 
>> The Lustre cluster has 4 OSSes as two failover pairs. Due to bad luck
>> we have one OSS unbootable, and replacing it will require taking its
>> live partner down too (though not any of the other Lustre servers).
>> 
>> We can prevent I/O to the Lustre FS by suspending (kill -STOP) the
>> user processes on the cluster compute nodes before the maintenance
>> work, and resuming them (kill -CONT) afterwards.
>> 
>> I don't know what would happen, though, in those cases where the
>> STOP'd process has an open file decriptor on the Lustre FS. If the
>> relevant OSS/OSTs become unavailable, and then available again, during
>> the STOP'd time, what would happen when the process is CONT'd?
>> 
>> I tried a Web search on this, but the best I could find was stuff
>> which assumed that one of a failover partner set would remain
>> available. or was specifially about evictions (which I guess are a
>> risk of this maintenance prccedure anyway). I did find one doc (
>> http://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency
>>  
>> )which suggested that silent data corruption was a possibility in the
>> event of evictions.
>> 
>> But what about non-evicted clients with open filehandles?
>> 
>> Thanks for any insight!
>> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Why certain commands should not be used on Lustre file system

2016-02-10 Thread Drokin, Oleg
Recursive rm on large directories can be sped up quite a bit if you use Ted 
Tso's fast_readdir.so shared library to sort readdir output by inode numbers.

Naturally it does not work for rm -rf * inside some dir since in such a 
scenario your shell has already expanded all the names (and typically in 
alphabetical order too) in the directory which is yet
another random order from the fs perspective.

Here's the email thread about it from the dawn of time: 
http://marc.info/?l=mutt-dev&m=107226330912347&w=2

Here's a more modern version: 
https://www.redhat.com/archives/ext3-users/2011-September/msg00013.html


On Feb 10, 2016, at 4:26 AM, Cowe, Malcolm J wrote:

> Recursive delete with rm -r is generally the slowest way to clear out a 
> directory tree (irrespective of file system). I've run tests where even "find 
>  -depth -delete" will complete more quickly than "rm -rf ". 
> There's also an rsync hack that some people like, and there's a funky perl 
> option:
> 
> perl -e 'for(<*>){((stat)[9]<(unlink))}'
> 
> which I dunno, seems like it is trying too hard. Found it on stackoverflow, I 
> think, so I'm not sure I quite trust it.
> 
> Stu's "find ... | xargs ... rm -f" looks like a winner though.
> 
> Malcolm.
> 
> -Original Message-
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
> Behalf Of Stu Midgley
> Sent: Wednesday, February 10, 2016 6:50 PM
> To: prakrati.agra...@shell.com
> Cc: lustrefs
> Subject: Re: [lustre-discuss] Why certain commands should not be used on 
> Lustre file system
> 
> We actually use
> 
>find  -type f -print0 | xargs -n 100 -P 32 -0 -- rm -f
> 
> which will parallelise the rm... which runs a fair bit faster.
> 
> 
> On Wed, Feb 10, 2016 at 3:33 PM,   wrote:
>> Hi,
>> 
>> Then rm -rf * should not be used in any kind of file system. Why only Lustre 
>> file system' best practices have this as a pointer.
>> 
>> Thanks and Regards,
>> Prakrati
>> 
>> -Original Message-
>> From: Dilger, Andreas [mailto:andreas.dil...@intel.com]
>> Sent: Wednesday, February 10, 2016 11:22 AM
>> To: Agrawal, Prakrati PTIN-PTT/ICOE; lustre-discuss@lists.lustre.org
>> Subject: Re: [lustre-discuss] Why certain commands should not be used on 
>> Lustre file system
>> 
>> On 2016/02/09, 21:16, "lustre-discuss on behalf of 
>> prakrati.agra...@shell.com" 
>> mailto:lustre-discuss-boun...@lists.lustre.org>
>>  on behalf of prakrati.agra...@shell.com> 
>> wrote:
>> 
>> I read on Lustre best practices that ls -U should be used instead of ls -l . 
>> I understand that ls -l makes MDS contact all OSS to get all information 
>> about all files and hence loads it. But, what does ls -U do to avoid it?
>> 
>>   -U do not sort; list entries in directory order
>> 
>> This is more important for very large directories, since "ls" will read all 
>> of the entries and stat them before printing anything.  That said, GNU ls 
>> will still read all of the entries before printing them, so for very large 
>> directories "find  -ls" is a lot faster to start printing entries.
>> 
>> Also, it is said that rm-rf * should not be used. Please can someone explain 
>> the reason for that.
>> 
>> It is also said that instead lfs find   --type f -print0 | 
>> xargs -0 rm -f should be used. Please explain the reason for this also.
>> 
>> "rm -rf *" will expand "*" onto the command line (done by bash) and if there 
>> are too many files in the directory (more than about 8MB IIRC) then bash 
>> will fail to execute the command.  Running "lfs find" (or just plain "find") 
>> will only print the filenames onto the output and xargs will process them in 
>> chunks that fit onto a command-line.
>> 
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Principal Architect
>> Intel High Performance Data Division
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> 
> 
> -- 
> Dr Stuart Midgley
> sdm...@sdm900.com
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] cache management of Lustre

2015-10-18 Thread Drokin, Oleg
Hello!

On Oct 17, 2015, at 9:49 PM, teng wang wrote:

> Can anyone tell me how Lustre client and server mange the cache?
> Or is there any related resource?

Server-side the cache is just straight VFS cache with usual controls
(though on OSTs you also get to exclude certain filesizes out).

On client it's also straight VFS cache, but there Lustre locks
play an important role in keeping those caches consistent.
There is an old Lustre Internals book by ORNL that contains a lot
of good info, though a bit stale at places. None the less all
the concepts remained the same and in particular all locking
interactions are still current.
http://users.nccs.gov/~fwang2/papers/lustre_report.pdf

Hope that helps.

Bye,
Oleg

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [PATCH 04/37] staging/lustre: tracefile: use 64-bit seconds

2015-09-24 Thread Drokin, Oleg

On Sep 24, 2015, at 3:46 AM, Arnd Bergmann wrote:

> On Thursday 24 September 2015 04:02:09 Drokin, Oleg wrote:
>>> The lustre tracefile has a timestamp defined as
>>> 
>>>  __u32 ph_sec;
>>>  __u64 ph_usec;
>>> 
>>> which seems completely backwards, as the microsecond portion of
>>> a time stamp will always fit into a __u32 value, while the second
>>> portion will overflow in 2038 or 2106 (in case of unsigned seconds).
>>> 
>>> This rectifies the situation by swapping out the types to have
>>> 64-bit seconds like everything else.
>>> 
>>> While this constitutes an ABI change, it seems to be reasonable
>>> for a debugging interface to change and is likely what was
>>> originally intended.
>> 
>> This is going to wreak some havoc as the old tools would obviously
>> misrepresent this, but the new tools also cannot assume blindly
>> this change is in place, since people tend to stick to old
>> lustre modules for a long time in production for various reasons,
>> while the tools might get upgraded.
>> So I wonder if we should include some sort of a hint somewhere that
>> the lctl could read and see which format it's going to convert from.
>> Either that or we'd need to play with some heuristic in the tools
>> to observe where the leading zeros are (in little ending) in one
>> and the other case (if the year is not quite 2038 yet) and
>> make a decision based on that.
> 
> Ok, I see.
> 
> If you can prove that the user space tools interpret this value
> as a unsigned 32-bit number, that would work until 2106 and we could
> document it as a restriction that way.

Well, let's see.
The same structure definition is used as in the kernel (in fact it's
the same header, just before it got adopted into the staging tree).

The users are in lnet/utils/debug.c in the same lustre tree I pointed you at
(tools live there too):

static int cmp_rec(const void *p1, const void *p2)
{
struct dbg_line *d1 = *(struct dbg_line **)p1;
struct dbg_line *d2 = *(struct dbg_line **)p2;

if (d1->hdr->ph_sec < d2->hdr->ph_sec)
return -1;
if (d1->hdr->ph_sec == d2->hdr->ph_sec &&
d1->hdr->ph_usec < d2->hdr->ph_usec)
return -1;
if (d1->hdr->ph_sec == d2->hdr->ph_sec &&
d1->hdr->ph_usec == d2->hdr->ph_usec)
return 0;
return 1;
}

   bytes = snprintf(out, sizeof(out),
"%08x:%08x:%u.%u%s:%u.%06llu:%u:%u:%u:"
"(%s:%u:%s()) %s",
hdr->ph_subsys, hdr->ph_mask,
hdr->ph_cpu_id, hdr->ph_type,
hdr->ph_flags & PH_FLAG_FIRST_RECORD ? "F" : "",
hdr->ph_sec, (unsigned long long)hdr->ph_usec,
hdr->ph_stack, hdr->ph_pid, hdr->ph_extern_pid,
line->file, hdr->ph_line_num, line->fn,
line->text);


fprintf(stderr, "  seconds = %u\n", hdr->ph_sec);


if (hdr->ph_len > 4094 ||   /* is this header bogus? */
hdr->ph_stack > 65536 ||
hdr->ph_sec < (1 << 30) ||
hdr->ph_usec > 10 ||
hdr->ph_line_num > 65536) {


These are all the users.
Seems to be pretty much u32 (or unsigned int when printing) everywhere.

> Another option would be to change the code storing the times there
> to do:
> 
>   header->ph_sec = (u32)ts.tv_sec;
>   header->ph_usec = (ts.tv_sec & 0xull) | 
>  (ts.tv_nsec / NSEC_PER_USEC);
> 
> and do the reverse on the user space side. This would be both endian-
> safe and backwards compatible, although rather ugly.

Indeed, thats' quite ugly.
I think in the next 90 years we will need some other adjustments to the
log format and should be able to include this one with them
(e.g. there are only 16 bits for the cpu, considering we have 260+ core
CPUs on the market today, might as well overflow that quite soon).


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [PATCH 32/37] staging/lustre: use 64-bit times for exp_last_request_time

2015-09-24 Thread Drokin, Oleg

On Sep 24, 2015, at 11:18 AM, Arnd Bergmann wrote:

> On Thursday 24 September 2015 03:55:20 Drokin, Oleg wrote:
>> On Sep 23, 2015, at 3:13 PM, Arnd Bergmann wrote:
>> 
>>> The last request time is stored as an 'unsigned long', which is
>>> good enough until 2106, but it is then converted to 'long' in
>>> some places, which overflows in 2038.
>>> 
>>> This changes the type to time64_t to avoid those problems.
>> 
>> Hm…
>> All this code is actually only making sense on server and is unused 
>> otherwise,
>> so it's probably best to drop ping_evictor_start, ping_evictor_main, 
>> exp_expired,
>> class_disconnect_export_list (and two places where it's called from) 
>> functions
>> and exp_last_request_time field.
>> And with ping evictor gone, we should also drop ptlrpc_update_export_timer.
>> 
>> While clients do retain the request handling code (to process 
>> server-originated
>> requests like lock callbacks), they are not going to evict the servers 
>> because
>> the server have not talked to us in a while or anything of the sort.
> 
> I tried doing this, but could not figure out how to get rid of
> class_disconnect_exports().

It's only called from class_cleanup like this:

/* The three references that should be remaining are the
 * obd_self_export and the attach and setup references. */
if (atomic_read(&obd->obd_refcount) > 3) {
/* refcount - 3 might be the number of real exports
   (excluding self export). But class_incref is called
   by other things as well, so don't count on it. */
CDEBUG(D_IOCTL, "%s: forcing exports to disconnect: %d\n",
   obd->obd_name, atomic_read(&obd->obd_refcount) - 3);
dump_exports(obd, 0);
class_disconnect_exports(obd);
}


This is only true on the servers so we can replace it with a corresponding 
LASSERT,
I imagine.

> However, I started removing dead code and ended up with a huge patch
> that would not make it to mailing list servers and that I therefor
> pasted on http://pastebin.com/uncZQNh7 for reference.

Wow, this is a large patch indeed.
Some parts of it I was contemplating on my own like whole dt_object.[ch] removal
with all the ties in llog code, though it was somewhat tricky as some bits in 
mgc
seems to be using that somehow.

I have not finished a detailed runthrough yet, but on the surface, why did you 
remove
suppress_pings parameter 0 that is still valid on clients, to let them not ping 
servers.
Also ptlrpc_ping_import_soon and friends - all that code is actually needed.

Only "ping evictor" is not needed on the client as it's only servers that are 
going to
kick out clients that are silent for too long. Clients are still expected to
send their "keep alive" pings in periodically (unless suppress_pings option is 
enabled).

Hm…. I now see it is not fully implemented, that's why you are removing it as 
it's really not
called from anywhere.


Anyway this does remove a lot of stuff that we don't really need in the client,
I'll try to get it built and tested just to make sure it does not really break 
anything
(unfortunately it does not seem to apply cleanly to the tip of staging-next 
tree).

Thank you!

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [PATCH 04/37] staging/lustre: tracefile: use 64-bit seconds

2015-09-23 Thread Drokin, Oleg

On Sep 23, 2015, at 3:13 PM, Arnd Bergmann wrote:

> The lustre tracefile has a timestamp defined as
> 
>   __u32 ph_sec;
>   __u64 ph_usec;
> 
> which seems completely backwards, as the microsecond portion of
> a time stamp will always fit into a __u32 value, while the second
> portion will overflow in 2038 or 2106 (in case of unsigned seconds).
> 
> This rectifies the situation by swapping out the types to have
> 64-bit seconds like everything else.
> 
> While this constitutes an ABI change, it seems to be reasonable
> for a debugging interface to change and is likely what was
> originally intended.

This is going to wreak some havoc as the old tools would obviously
misrepresent this, but the new tools also cannot assume blindly
this change is in place, since people tend to stick to old
lustre modules for a long time in production for various reasons,
while the tools might get upgraded.
So I wonder if we should include some sort of a hint somewhere that
the lctl could read and see which format it's going to convert from.
Either that or we'd need to play with some heuristic in the tools
to observe where the leading zeros are (in little ending) in one
and the other case (if the year is not quite 2038 yet) and
make a decision based on that.

> 
> Signed-off-by: Arnd Bergmann 
> ---
> drivers/staging/lustre/include/linux/libcfs/libcfs_debug.h   | 4 ++--
> drivers/staging/lustre/lustre/libcfs/linux/linux-tracefile.c | 8 
> 2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_debug.h 
> b/drivers/staging/lustre/include/linux/libcfs/libcfs_debug.h
> index a3aa644154e2..dfb81022397d 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_debug.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_debug.h
> @@ -73,8 +73,8 @@ struct ptldebug_header {
>   __u32 ph_mask;
>   __u16 ph_cpu_id;
>   __u16 ph_type;
> - __u32 ph_sec;
> - __u64 ph_usec;
> + __u64 ph_sec;
> + __u32 ph_nsec;
>   __u32 ph_stack;
>   __u32 ph_pid;
>   __u32 ph_extern_pid;
> diff --git a/drivers/staging/lustre/lustre/libcfs/linux/linux-tracefile.c 
> b/drivers/staging/lustre/lustre/libcfs/linux/linux-tracefile.c
> index 87d844953522..fad272d559c4 100644
> --- a/drivers/staging/lustre/lustre/libcfs/linux/linux-tracefile.c
> +++ b/drivers/staging/lustre/lustre/libcfs/linux/linux-tracefile.c
> @@ -191,16 +191,16 @@ cfs_set_ptldebug_header(struct ptldebug_header *header,
>   struct libcfs_debug_msg_data *msgdata,
>   unsigned long stack)
> {
> - struct timeval tv;
> + struct timespec64 ts;
> 
> - do_gettimeofday(&tv);
> + ktime_get_real_ts64(&ts);
> 
>   header->ph_subsys = msgdata->msg_subsys;
>   header->ph_mask = msgdata->msg_mask;
>   header->ph_cpu_id = smp_processor_id();
>   header->ph_type = cfs_trace_buf_idx_get();
> - header->ph_sec = (__u32)tv.tv_sec;
> - header->ph_usec = tv.tv_usec;
> + header->ph_sec = ts.tv_sec;
> + header->ph_nsec = ts.tv_nsec;
>   header->ph_stack = stack;
>   header->ph_pid = current->pid;
>   header->ph_line_num = msgdata->msg_line;
> -- 
> 2.1.0.rc2
> 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre support on AArch64 (ARMv8)

2015-09-01 Thread Drokin, Oleg

On Sep 1, 2015, at 12:06 PM, Oleg Drokin wrote:

> Hello!
> 
> On Sep 1, 2015, at 6:51 AM, Sreenidhi Bharathkar Ramesh wrote:
> 
>> Hi,
>> 
>> I am interested in trying out Lustre , both server and client, on AArch64 
>> (ARMv8) platform.  
>> 
>> Setup: 
>> - Juno development board with linaro kernel (3.10 or 3.19) 
>> - plan to build lustre from sources.
>> - Ubuntu 14.04 or fc21 rootfs
>> 
>> 1. platform question:  Is Lustre supported on AArch64 platform ?
>> 
>> 2. kernel version question: is there any specific kernel version requirement 
>> ?
> 
> If you want server support, you need to stick to RHEL6.x, rhel7.x or SLES12 
> kernels.
> Outside of those you are on your own for ldiskfs patches, I guess.
> 
> You'll need this patch http://review.whamcloud.com/#/c/15395/ to enable arm 
> building on our tree.

BTW, if you end up trying this patch, please report back if it actually works 
since we have no arm hardware to even try building and nobody so far came 
forward with a builder that can build arm code (And ideally also test it at 
least ocasionally).

Thanks.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre support on AArch64 (ARMv8)

2015-09-01 Thread Drokin, Oleg
Hello!

On Sep 1, 2015, at 6:51 AM, Sreenidhi Bharathkar Ramesh wrote:

> Hi,
> 
> I am interested in trying out Lustre , both server and client, on AArch64 
> (ARMv8) platform.  
> 
> Setup: 
> - Juno development board with linaro kernel (3.10 or 3.19) 
> - plan to build lustre from sources.
> - Ubuntu 14.04 or fc21 rootfs
> 
> 1. platform question:  Is Lustre supported on AArch64 platform ?
> 
> 2. kernel version question: is there any specific kernel version requirement ?

If you want server support, you need to stick to RHEL6.x, rhel7.x or SLES12 
kernels.
Outside of those you are on your own for ldiskfs patches, I guess.

You'll need this patch http://review.whamcloud.com/#/c/15395/ to enable arm 
building on our tree.

Clients would likely build with a wider set of kernels including various ubuntu 
variants.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] FIEMAP support for Lustre

2015-08-24 Thread Drokin, Oleg
Hello!

   fiemap is not implemented on ZFS backends at this time.

Bye,
Oleg
On Aug 24, 2015, at 3:22 PM, Alexander I Kulyavtsev wrote:

> Hi Oleg,
> does ZFS based lustre supports FIEMAP?
> 
> We have lustre 2.5 with zfs installed. Otherwise we will need to setup 
> separate test system with ldiskfs.
> 
> But: please review separate reply, I think this can be addressed through 
> multirail, NRS, file striping.
> 
> Best regards, Alex.
> 
> On Aug 24, 2015, at 11:06 AM, Drokin, Oleg  wrote:
> 
>> Hello!
>> 
>> On Aug 24, 2015, at 11:57 AM, Wenji Wu wrote:
>> 
>>> Hello, everybody,
>>> 
>>> I understand that ext2/3/4 support FIEMAP to get file extent mapping. 
>>> 
>>> Does Lustre supports similar feature like FIEMAP? Can Lustre client gets 
>>> FIEMAP-like information on a Luster file system?
>> 
>> Yes, Lustre does support fiemap.
>> You can see patched ext4progs and the filefrag included there works on top 
>> of Lustre too, as an example.
>> 
>> lustre/tests/checkfiemap.c in the lustre source tree is another example user 
>> of this functionality that you can consult.
>> 
>> Bye,
>>   Oleg
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] FIEMAP support for Lustre

2015-08-24 Thread Drokin, Oleg
Hello!

On Aug 24, 2015, at 11:57 AM, Wenji Wu wrote:

> Hello, everybody,
> 
> I understand that ext2/3/4 support FIEMAP to get file extent mapping. 
> 
> Does Lustre supports similar feature like FIEMAP? Can Lustre client gets 
> FIEMAP-like information on a Luster file system?

Yes, Lustre does support fiemap.
You can see patched ext4progs and the filefrag included there works on top of 
Lustre too, as an example.

lustre/tests/checkfiemap.c in the lustre source tree is another example user of 
this functionality that you can consult.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Target names changing when mounting

2015-07-31 Thread Drokin, Oleg
I see.
Well, using those symlinks was certainly not expected when this code was 
written.
So I imagine avoid using them and perhaps use /dev/disk/by-uuid which should be 
more stable
in the face of writeconfs?

On Jul 31, 2015, at 2:36 PM, Andrus, Brian Contractor wrote:

> Oleg, 
> 
> When it changes the e2label, the links in /dev/disk/by-label get changed too.
> 
> 
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
> 
> 
> 
> 
> -----Original Message-
> From: Drokin, Oleg [mailto:oleg.dro...@intel.com] 
> Sent: Friday, July 31, 2015 11:31 AM
> To: Andrus, Brian Contractor
> Cc: lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] Target names changing when mounting
> 
> Hello!
> 
> On Jul 31, 2015, at 2:08 PM, Andrus, Brian Contractor wrote:
> 
>> All,
>> 
>> I find this very confusing..
>> I created a new lustre filesystem (WORK2) by creating and MDT and a single 
>> OST.
>> When I first create them, the names shown for Target have different values 
>> from 'Read precious values' and 'Permanent disk data'
>> The difference is a hyphen '-' versus a colon ':'
> 
> This is internal bookkeeping to show target status:
> 
>/* svname of the form lustre:OST1234 means never registered */
>   rc = strlen(ldd->ldd_svname);
>if (ldd->ldd_svname[rc - 8] == ':') {
>ldd->ldd_svname[rc - 8] = '-';
>ldd->ldd_flags |= LDD_F_VIRGIN;
>} else if (ldd->ldd_svname[rc - 8] == '=') {
>ldd->ldd_svname[rc - 8] = '-';
>ldd->ldd_flags |= LDD_F_WRITECONF;
>}
> 
>> 
>>   Read previous values:
>> Target: WORK2-MDT
>> 
>>   Permanent disk data:
>> Target: WORK2:MDT
>> 
>> The ext2 label on the disk matches the Permanent disk data.
>> BUT
>> When I mount it (mount -t lustre) the Permanent disk data changes to use a 
>> colon and the e2label changes
>>   Read previous values:
>> Target: WORK2-MDT
>> 
>>   Permanent disk data:
>> Target: WORK2-MDT
>> 
>> THEN...
>> If I do a tunefs.lustre --writeconf , it changes the Permanent disk data AND 
>> the e2label to use an equal '=' sign:
>> 
>>   Read previous values:
>> Target: WORK2-MDT
>> 
>>   Permanent disk data:
>> Target: WORK2=MDT
>> 
>> Next time I mount it changes it back to the hyphen '-' So.. What the 
>> heck???
> 
> Matches the code so far.
> 
>> Now the problem/annoyance is the fact that this changes links in /dev...
> 
> What links? This is internal lustre target name, your block device name 
> should still stay the same as it was.
> 
>> And an equal sign '=' is NOT a happy character for device names... I am 
>> also highly suspicious this affects some of the mappings in the MGS 
>> and MDT which is causing great grief...
> 
> Should not, really, since this is expected and we treat those changes as 
> meaningful to indicate the target status.
> 
> Bye,
>Oleg
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Target names changing when mounting

2015-07-31 Thread Drokin, Oleg
Hello!

On Jul 31, 2015, at 2:08 PM, Andrus, Brian Contractor wrote:

> All,
>  
> I find this very confusing..
> I created a new lustre filesystem (WORK2) by creating and MDT and a single 
> OST.
> When I first create them, the names shown for Target have different values 
> from ‘Read precious values’ and ‘Permanent disk data’
> The difference is a hyphen ‘-‘ versus a colon ‘:’

This is internal bookkeeping to show target status:

/* svname of the form lustre:OST1234 means never registered */
rc = strlen(ldd->ldd_svname);
if (ldd->ldd_svname[rc - 8] == ':') {
ldd->ldd_svname[rc - 8] = '-';
ldd->ldd_flags |= LDD_F_VIRGIN;
} else if (ldd->ldd_svname[rc - 8] == '=') {
ldd->ldd_svname[rc - 8] = '-';
ldd->ldd_flags |= LDD_F_WRITECONF;
}

>  
>Read previous values:
> Target: WORK2-MDT
>  
>Permanent disk data:
> Target: WORK2:MDT
>  
> The ext2 label on the disk matches the Permanent disk data.
> BUT….
> When I mount it (mount -t lustre) the Permanent disk data changes to use a 
> colon and the e2label changes….
>Read previous values:
> Target: WORK2-MDT
>  
>Permanent disk data:
> Target: WORK2-MDT
>  
> THEN…
> If I do a tunefs.lustre --writeconf , it changes the Permanent disk data AND 
> the e2label to use an equal ‘=’ sign:
>  
>Read previous values:
> Target: WORK2-MDT
>  
>Permanent disk data:
> Target: WORK2=MDT
>  
> Next time I mount it changes it back to the hyphen ‘-‘
> So.. What the heck???

Matches the code so far.
 
> Now the problem/annoyance is the fact that this changes links in /dev…

What links? This is internal lustre target name, your block device
name should still stay the same as it was.

> And an equal sign ‘=’ is NOT a happy character for device names…
> I am also highly suspicious this affects some of the mappings in the MGS and 
> MDT which is causing great grief…

Should not, really, since this is expected and we treat those changes as 
meaningful to
indicate the target status.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.7.0 stable enough?

2015-05-27 Thread Drokin, Oleg
Hello!

Um, have you tried to disable selinux yet?

Bye,
Oleg
On May 27, 2015, at 1:56 PM, Lukas Hejtmanek wrote:

> Hello,
> 
> trying to set up my test Lustre testbed, however, I didn't get very far.
> 
> mkfs.lustre --fsname=Lustre1 --mgs --mdt --index=0 /dev/sda4
> 
> mount /dev/sda4 /mnt/lustre-mgs -t lustre
> 
> and mount hangs.
> 
> INFO: task mount.lustre:12281 blocked for more than 120 seconds.
>  Not tainted 2.6.32-504.16.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> mount.lustre  D  0 12281  12280 0x0080
> 88186b937c70 0082  
> ffea 88186b937c78 005eeb2941f8 88186b937c48
> 0286 00010001a232 8818576d1068 88186b937fd8
> Call Trace:
> [] rwsem_down_failed_common+0x95/0x1d0
> [] rwsem_down_read_failed+0x26/0x30
> [] call_rwsem_down_read_failed+0x14/0x30
> [] ? down_read+0x24/0x30
> [] ldiskfs_quota_off+0x1bc/0x1f0 [ldiskfs]
> [] ? selinux_sb_copy_data+0x14a/0x1e0
> [] deactivate_locked_super+0x46/0x90
> [] vfs_kern_mount+0x10d/0x1b0
> [] do_kern_mount+0x52/0x130
> [] do_mount+0x2fb/0x930
> [] sys_mount+0x90/0xe0
> [] system_call_fastpath+0x16/0x1b
> 
> 
> Am I doing something terribly wrong?
> 
> -- 
> Lukáš Hejtmánek
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Coexistence of Luster client and server on a single machine

2015-05-22 Thread Drokin, Oleg
Hello!

On May 22, 2015, at 7:33 AM, Lukas Hejtmanek wrote:
> On Fri, May 22, 2015 at 04:27:12AM +, Dilger, Andreas wrote:
>> There has been occasional work to fix problems with memory pressure
>> deadlocks from a client on the same node as the OSS, but this has never
>> been a primary goal of any users/developers.
>> 
>> The main problem with running a client on the same node as an OSS or MDS
>> is that this prevents Lustre recovery from working, because the client on
>> that node will never be available during recovery.
>> 
>> Depending on your requirements and your usage, this may be acceptable, but
>> it is definitely not a way that most people are using Lustre and you need
>> to be aware of that.  If you are willing to test this out I think you'd be
>> doing other users a service to know how this is working.
> 
> thank you for info. My idea was to try to replace my GPFS setup, that uses one
> disk array (several LUNs) and three front-ends. GPFS is exported via Samba.
> 
> So I think about Lustre OSS and MDS to run on those three front-ends, but to
> be able to export Lustre via Samba, I would need Lustre mounted on at least
> one of the front-ends for Samba backend.

I imagine samba workload is a lot simplier than a typical client workload,
that might do all sorts of crazy stuff.
There's a big advantage to have samba reepxporter (or NFS) on MDS since you save
quite a bunch on those synchronous metadata RPCs.
Chances of it working are pretty good, but that's still not really a supported
configuration unless you get your vendor to sign off on that somehow.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Size difference between du and quota

2015-05-20 Thread Drokin, Oleg
Hello!

 I imagine this could be related to the number of blocks, and also,
 du would not report any unlinked, but still opened files that still take disk 
space, as they are no longer in the namespace,
 but quota still sees them.

Bye,
Oleg
On May 20, 2015, at 4:50 AM, Phill Harvey-Smith wrote:

> Hi all,
> 
> One of my users is reporting a massive size difference between the figures 
> reported by du and quota.
> 
> doing a du -hs on his directory reports :
> du -hs .
> 529G.
> 
> doing a lfs quota -u username /storage reports
> Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
> /storage 621775192  64000 64001   -  601284  100 110  
>  -
> 
> Though this user does have a lot of files :
> 
> find . -type f | wc -l
> 581519
> 
> So I suspect that it is the typical thing that quota is reporting used blocks 
> whilst du is reporting used bytes, which can of course be wildly different 
> due to filesystem overhead and wasted unused space at the end of files where 
> a block is allocated but only partially used.
> 
> Is this likely to be the case ?
> 
> I'm also not entirely sure what versions of lustre the client machines and 
> MDS / OSS servers are running, as I didn't initially set the system up.
> 
> Cheers.
> 
> Phill.
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.

2015-05-08 Thread Drokin, Oleg
Hello!

On Apr 30, 2015, at 6:02 PM, Scott Nolin wrote:

> Has anyone been working with the lustre jobstats feature and SLURM? We have 
> been, and it's OK. But now that I'm working on systems that run a lot of 
> array jobs and a fairly recent slurm version we found some ugly stuff.
> 
> Array jobs report their do SLURM_JOBID as a variable, and it's unique for 
> every job. But they use other IDs too that appear only for array jobs.
> 
> http://slurm.schedmd.com/job_array.html
> 
> However, that unique SLURM_JOBID as far as I can tell is only truly exposed 
> in command line tools via 'scontrol' - which is only valid while the job is 
> running. If you want to look at older jobs with sacct for example, things are 
> troublesome.
> 
> Here's what my coworker and I have figured out:
> 
> - You submit a (non-array) job that gets jobid 100.
> - The next job gets jobid 101.
> - Then submit a 10 task array job. That gets jobid 102. The sub tasks get 9 
> more job ids. If nothing else is happening with the system, that means you 
> use jobid 102 to 112.
> 
> If things were that orderly, you could cope with using SLURM_JOB_ID in lustre 
> jobstats pretty easily. Use sacct and you see job 102_2 - you know that is 
> jobid 103 in lustre jobstats.
> 
> But, if other jobs get submitted during set up (as of course they do), they 
> can take jobid 103. So, you've got problems.
> 
> I think we may try to set a magic variable in the slurm prolog and use that 
> for the jobstats_var, but who knows.

There's another method planned for doing jobid stuff, now mainly featured in 
kernel staging tree, but will make it's way to lustre tree too.

It's to just write your jobid directly into lustre from your prologue script 
(and clear from epilogue).

That way you can set it to whatever you like without ugly messings with shell 
variables (and equally ugly parsing of those variables from the kernel!).

For some reason I cannot find the corresponding master patch, though I have a 
passing memory of writing it, so this needs to be addressed separately.

Bye,
Oleg
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] What happens if my stripe count is set to more than my number of stripes

2015-04-20 Thread Drokin, Oleg
Hello!

   The allocator is not really playing it's role just yet, this is just code 
that stores your desired striping in the directory for later use.
   But if you call into allocator (at file creation time) with these 
parameters, it would just assume you want all your current OSTs and that's it.

Bye,
Oleg
On Apr 20, 2015, at 1:43 PM, Michael Kluge wrote:

> Hi Oleg,
> 
> I tried it and looks like it actually stores the 128 stripe size (at least 
> for dirs). lfs getstripe tells me that my dir is now striped over 128 OSTs (I 
> have 48).
> 
> [/scratch/mkluge] lctl dl | grep osc | wc -l
> 48
> [/scratch/mkluge] mkdir  p
> [/scratch/mkluge] lfs setstripe -c 128 p
> [/scratch/mkluge] lfs getstripe p
> p
> stripe_count:   128 stripe_size:1048576 stripe_offset:  -1
> 
> 
> Regards, Michael
> 
> Am 20.04.2015 um 18:44 schrieb Drokin, Oleg:
>> Hello!
>> 
>> Current allocator behaviour is such that when you specify more
>> stripes than you have OSTs, it'll treat it the same as if you set
>> stripe count to -1 (that is - the maximum possible stripes).
>> 
>> Bye, Oleg On Apr 20, 2015, at 4:47 AM, 
>>  wrote:
>> 
>>> Hi,
>>> 
>>> I have a doubt regarding Lustre file system. If I have a file of
>>> size 64 GB and I set stripe size to 1GB, my number of stripes
>>> become 64. But if I set my stripe count as 128, what does the
>>> Lustre do in that case?
>>> 
>>> Thanks and Regards, Prakrati
>>> ___ lustre-discuss
>>> mailing list lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>> ___ lustre-discuss
>> mailing list lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] What happens if my stripe count is set to more than my number of stripes

2015-04-20 Thread Drokin, Oleg
Oh, I see.
Well, nothing will be written at the other OSTs since this file does not have 
data, but objects would be allocated still and sit empty. If you decide to 
write more data to that file - the data will go into the additional OSTs. If no 
data will ever go there the objects would just sit zero-sized.

On Apr 20, 2015, at 12:53 PM, 
 wrote:

> Hi,
> 
> I have 165 OSTs. Right now the problem is that because of the file size and 
> stripe size set as 64 GB and 1GB respectively and 64 ranks, the stripes to be 
> written by each rank come out to be 1GB each.
> Now if I set my stripe count to 128, rank1 will write stripe 1 on OST 1, rank 
> 2 will write stripe 2 on OST 2 and so on. After 64 stripes, all data is 
> written, then what will be written on the rest of the remaining 64 OSTs.
> 
> 
> -----Original Message-
> From: Drokin, Oleg [mailto:oleg.dro...@intel.com] 
> Sent: Monday, April 20, 2015 10:14 PM
> To: Agrawal, Prakrati PTIN-PTT/ICOE
> Cc: 
> Subject: Re: [lustre-discuss] What happens if my stripe count is set to more 
> than my number of stripes
> 
> Hello!
> 
>   Current allocator behaviour is such that when you specify more stripes than 
> you have OSTs, it'll treat it the same as if you set stripe count to -1 (that 
> is - the maximum possible stripes).
> 
> Bye,
>Oleg
> On Apr 20, 2015, at 4:47 AM,   
>  wrote:
> 
>> Hi,
>> 
>> I have a doubt regarding Lustre file system. If I have a file of size 64 GB 
>> and I set stripe size to 1GB, my number of stripes become 64. But if I set 
>> my stripe count as 128, what does the Lustre do in that case?
>> 
>> Thanks and Regards,
>> Prakrati
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] What happens if my stripe count is set to more than my number of stripes

2015-04-20 Thread Drokin, Oleg
Hello!

   Current allocator behaviour is such that when you specify more stripes than 
you have OSTs, it'll treat it the same as if you set stripe count to -1 (that 
is - the maximum possible stripes).

Bye,
Oleg
On Apr 20, 2015, at 4:47 AM, 
  wrote:

> Hi,
>  
> I have a doubt regarding Lustre file system. If I have a file of size 64 GB 
> and I set stripe size to 1GB, my number of stripes become 64. But if I set my 
> stripe count as 128, what does the Lustre do in that case?
>  
> Thanks and Regards,
> Prakrati
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] RPC packing

2014-10-07 Thread Drokin, Oleg
btw, it should be noted that by the same file we really mean "same object".
IF you have a file that's striped two-way with stripe size of 1M,
then two 4k writes, one at offset 0 and another at offset 1M would not
be merged into single rpc because they would be sent to different servers,
but two writes at offsets 0 and 512K would be merged into same RPC.

On Oct 7, 2014, at 8:33 PM, Drokin, Oleg wrote:

> Hello!
> 
> On Oct 7, 2014, at 11:03 AM, teng wang wrote:
>> I know the RPC will pack the
>> client's data into 4MB buffer and transfer. The question is does the
>> client pack the writes even the writes on the client belong
>> to noncontiguous extents of the same file? Say, 1 client is issuing
>> two writes on two noncontiguous extents of the same file, the first write
>> from offset 0 to 16KB-1, the second write from 32KB to 
>> 48KB-1,  apparently these two writes are noncontiguous, so
>> will the two writes still be packed into the 4MB buffer?
> 
> Yes, noncontiguous writes of the same type (i.e. you cannot mix diret io
> and normal directio) would be packed in the same RPC
> 
> Bye,
>Oleg
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] RPC packing

2014-10-07 Thread Drokin, Oleg
Hello!

On Oct 7, 2014, at 11:03 AM, teng wang wrote:
>  I know the RPC will pack the
> client's data into 4MB buffer and transfer. The question is does the
> client pack the writes even the writes on the client belong
> to noncontiguous extents of the same file? Say, 1 client is issuing
> two writes on two noncontiguous extents of the same file, the first write
> from offset 0 to 16KB-1, the second write from 32KB to 
> 48KB-1,  apparently these two writes are noncontiguous, so
> will the two writes still be packed into the 4MB buffer?

Yes, noncontiguous writes of the same type (i.e. you cannot mix diret io
and normal directio) would be packed in the same RPC

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 2.5.2 Client Errors

2014-08-26 Thread Drokin, Oleg
Hello!

   I guess I don't have any smart ideas about what's goinf on.
   Is the number after "LustreError" the same most of the time? that's pid, so 
you can chek what's the process doing it.
   Also I imagine you can add a dump_stack() call to that condition in the 
source so that a backtrace is printed and we know what is the call path.

Bye,
Oleg
On Aug 26, 2014, at 12:28 AM, Murshid Azman wrote:

> Hello Oleg,
> 
> Thanks for your response.
> 
> The kernel I'm using is 2.6.32-279.19.1.el6.x86_64
> 
> This client mounts NFS shares from another server. They're not mounted 
> directly onto Lustre filesystem, but rather onto a tmpfs filesystem residing 
> in the memory.
> 
> [root@node01 ~]# df -h /
> FilesystemSize  Used Avail Use% Mounted on
> tmpfs  50M  484K   50M   1% /
> 
> I've removed the NFS mounts but can still see the same error. This client 
> does not share NFS to others.
> 
> Thank you.
> 
> Murshid Azman.
> 
> 
> 
> On Fri, Aug 22, 2014 at 9:12 PM, Drokin, Oleg  wrote:
> Hello!
> 
> On Aug 22, 2014, at 3:28 AM, Murshid Azman wrote:
> > We're trying to run a cluster image on Lustre filesystem version 2.5.2 and 
> > repeatedly seeing the following message. Haven't seen anything bizarre on 
> > this machine other than this:
> >
> > 2014-08-22T13:52:01+07:00 node01 kernel: LustreError: 
> > 4271:0:(namei.c:530:ll_lookup_
> > it()) Tell Peter, lookup on mtpt, it open
> > 2014-08-22T13:52:01+07:00 node01 kernel: LustreError: 
> > 4271:0:(namei.c:530:ll_lookup_it()) Skipped 128 previous similar messages
> >
> > This doesn't happen to our desktop Lustre clients.
> >
> > I'm wondering if anyone has any idea what this means.
> 
> This is one of those "impossible condition has been met" messages. 
> Essentially it means that we got a lookup call for a mountpount,
> which should not happen because it's a mountpoint, so it's always valid and 
> pinned in memory at the very least.
> 
> What kernel do you use? anything else interesting about this client - e.g. 
> NFS rexport?
> 
> Bye,
> Oleg
> 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 2.5.2 Client Errors

2014-08-22 Thread Drokin, Oleg
Hello!

On Aug 22, 2014, at 3:28 AM, Murshid Azman wrote:
> We're trying to run a cluster image on Lustre filesystem version 2.5.2 and 
> repeatedly seeing the following message. Haven't seen anything bizarre on 
> this machine other than this:
> 
> 2014-08-22T13:52:01+07:00 node01 kernel: LustreError: 
> 4271:0:(namei.c:530:ll_lookup_
> it()) Tell Peter, lookup on mtpt, it open
> 2014-08-22T13:52:01+07:00 node01 kernel: LustreError: 
> 4271:0:(namei.c:530:ll_lookup_it()) Skipped 128 previous similar messages
> 
> This doesn't happen to our desktop Lustre clients.
> 
> I'm wondering if anyone has any idea what this means.

This is one of those "impossible condition has been met" messages. Essentially 
it means that we got a lookup call for a mountpount,
which should not happen because it's a mountpoint, so it's always valid and 
pinned in memory at the very least.

What kernel do you use? anything else interesting about this client - e.g. NFS 
rexport?

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How do I downgrade from 2.5.0 to 2.1.6?

2014-01-08 Thread Drokin, Oleg

On Jan 8, 2014, at 10:15 AM, Oliver Mangold wrote:
>> Is this some sort of kerberos enabled deployment?
> No nothing like that. Even group upcall is disabled.

There's all this sptlrpc chatter coming out from somewhere that should not be 
there.
I suspect there might be a bad entry in your mgs config somewhere that throws 
things off.
You might need to mount mgs fs as ldiskfs in order to remove it and then it's 
going to be in CONFIGS dir.

Hm, in fact I don't think we even test downgrading 2.5-formatted fs to 2.1, so 
I am not sure if that'll even work.
My 2.5 configs certainly don't seem to be having any sptlrpc entries according 
to my logs, so I am not sure how did you get those.

Also rereading your email, can you please elaborate more on the 2.5 stability 
issues?

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How do I downgrade from 2.5.0 to 2.1.6?

2014-01-08 Thread Drokin, Oleg
Hello!

On Jan 8, 2014, at 9:58 AM, Oliver Mangold wrote:
> Lustre: MGS MGS started
> LustreError: 5476:0:(mgc_request.c:76:mgc_name2resid()) missing name: 
> -sptlrpc
> Lustre: 5528:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection 
> LustreError: 5476:0:(mgc_request.c:286:config_log_add()) can't create 
> sptlrpc log: -sptlrpc

Is this some sort of kerberos enabled deployment?

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Poor Direct-IO Performance with Lustre-2.1.5

2013-06-22 Thread Drokin, Oleg
Hello!

On Jun 22, 2013, at 7:50 PM, Chan Ching Yu, Patrick wrote:
> Record Size 1024 KB

This is your problem.
DirectIO is synchronous in nature so only as much data as you fit into a single 
syscall will be in flight at any given moment.
In this case you only have 1 RPC in flight.
The key to adequately performing DirectIO is to have read/write syscalls with a 
lot of data. 8*num_stripes is the very minimum, I would think.
Even then you likely won't get cached read/write performance unless you run a 
very underpowered client that gets choked on memory copies.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Slow Copy (Small Files) 1 RPC In Flight?

2013-06-21 Thread Drokin, Oleg
Hello!

On Jun 21, 2013, at 9:07 PM, Andrew Mast wrote:
> Very clear, thank you for the explanation, I misunderstood readahead. Yes the 
> 1gb and 10gb file transfer tests was on par with NFS.
> 
> Our use case is typically compiling and find/grep through (30gb) amounts of 
> source code so it seems we are stuck with small files.

Generally this sort of workload is pretty bad for network filesystems due to 
large amounts of synchronous RPC traffic that you cannot easily predict.
You can get certain speedup by doing several copies in parallel (e.g. one copy 
per top level subtree or whatever) as then you'll at least get concurrent RPCs.

I know some people try to combat this by running a block device on top of 
network filesystem and then running some sort of a local fs (say, ext4)
on top of that block device (loopback based). That allows readahead to work, 
caching to work much better and so on. But this is not without limitations too,
only single node could have this filesystem-file mounted at any single time.

IF you do not have any significant writes to this fileset (if any at all) but a 
lot of consecutive reads/greps…, you might want just store entire workset as a 
tar file, that you will read and unpack locally on a client (should be pretty 
fast) to say a ramfs (need tons of RAM of course) and then do the searches. 
Also not ideal, but at least network filesystem would then be doing what it's 
best suited for - large transfers.

If you can come up with some other way of storing large number of smaller files 
in a single large combined file that you will then access with special tools 
(like, I dunno, fuse-tarfs or whatever - assuming those don't read unneeded 
data, but just skip over it, or something more specific to your case) - this 
might be a winner too.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Slow Copy (Small Files) 1 RPC In Flight?

2013-06-21 Thread Drokin, Oleg
Hello!

On Jun 21, 2013, at 5:42 PM, Andrew Mast wrote:

> Hello, I am new to Lustre and wanted to run a small simple small copy test 
> between 2 virtual machines from MDT/OST server to client's local disk.

> I realize small file performance is never fast, but this seems particularly 
> slow considering the data is all buffered in memory with little to no disk 
> activity.
> 
> RPC stats on the client shows only 1 RPC in flight at a time. max inflight is 
> set to 64. Is that expected behavior for a copy?

Well, it seems you are reading from Lustre. Small files too.
So Lustre reads a single file at a time (I assume you copy with somehing like 
cp - single threadedly), readahead does not come into play because file size
is smaller than 1 RPC.
So before we are done with a single file, we cannot guess there'd be another 
request to the next file. That's why you have only one RPC in flight.

Also Lustre metadata protocol is somewhat more heavy than NFS, which would 
explain why it's slower than NFS.
Situation should improve once you start trying bigger files.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST corruption?

2013-04-26 Thread Drokin, Oleg
Hello!

   In reality there's no "used" or "reserved" in the statfs(2) output. All you 
get is total, free and avail.
   df and the likes typically calculate used as total-avail. Reserved would be 
avail-free.

Bye,
Oleg
On Apr 26, 2013, at 10:56 AM, Andrus, Brian Contractor wrote:

> FWIW, your total should be used+free+reserved
> There is normally a % set aside for root only 
> This is changeable too.
> tune2fs -m 0 
> 
> 
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
> 
> 
>> -Original Message-
>> From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-
>> boun...@lists.lustre.org] On Behalf Of Verduzco, Benjamin P.
>> Sent: Monday, April 22, 2013 9:49 AM
>> To: Mohr Jr, Richard Frank (Rick Mohr)
>> Cc: Lustre-discuss@lists.lustre.org
>> Subject: Re: [Lustre-discuss] OST corruption?
>> 
>> That's correct, the used and free don't equal the total. While there
>> is just about a 5% difference, I know the file system was reporting a more
>> reasonable 158 T used before the reboot.
>> 
>> Despite the file system size oddness, it seems to be working, so we'll keep 
>> it
>> in service for now, but we're moving up our plans to replace our MDS and
>> have consistent versions of both Lustre and the OS across all hosts.
>> 
>> 
>> Thanks,
>> Ben
>> -Original Message-
>> From: Mohr Jr, Richard Frank (Rick Mohr) [mailto:rm...@utk.edu]
>> Sent: Wednesday, April 17, 2013 5:07 PM
>> To: Verduzco, Benjamin P.
>> Cc: 
>> Subject: Re: [Lustre-discuss] OST corruption?
>> 
>> 
>> On Apr 3, 2013, at 1:52 PM, "Verduzco, Benjamin P."
>>  wrote:
>> 
>>> * When I brought the system up, it reported the used space
>> incorrectly (df -h shows 166T total, 150T used and 7.1 T free)
>> 
>> When you say that the used space is being reported incorrectly, do you
>> mean that the sum of the used and free space does not match the total
>> space?  Or do you have a reason to believe that 150 TB is not actually
>> used?  If it is the former, then that could be explained by the fact
>> that ext file systems by default reserve 5% of the disk space for the
>> root user.
>> 
>> --
>> Rick Mohr
>> Senior HPC System Administrator
>> National Institute for Computational Sciences
>> http://www.nics.tennessee.edu
>> 
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Error on a specifc lustre directory

2013-04-16 Thread Drokin, Oleg
Hello!

On Apr 9, 2013, at 4:02 AM, Alfonso Pardo wrote:
 
> But if I try to access or delete I only get:
>  
> ls: cannot access hubble_psf_660__691_555_.psf: Interrupted system call
>  
> or if I try to delete:
>  
> rm: cannot access hubble_psf_660__691_555_.psf: Interrupted system call

Does it fail right away, or after some waiting?
Anything interesting you see in dmesg after that?
If not, perhaps try to capture a lustre log of the operation at some extended 
log level and see if there are any clues there?

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How to debug a client's eviction.

2013-02-06 Thread Drokin, Oleg
Hello!

On Feb 4, 2013, at 12:06 PM, Theodoros Stylianos Kondylis wrote:
> I try to debug this situation so I did the following ::
> 
> >> echo 1 > /proc/sys/lustre/dump_on_eviction
> >> echo 1 > /proc/sys/lustre/dump_on_timeout
> 
> And in the /proc/sys/lnet/debug file there is ::
> 
> ioctl neterror warning error emerg ha config console

rpctrace and dlmtrace seem t one two important ones to see what was sent and 
received where,
after you check those and narrow the problem down to something more specific, 
you
might want to enable some more debug and retry.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss