Re: [Gluster-devel] NetBSD regressions, memory corruption
looks like the iobref (and the iobuf) was allocated in protocol/server.. (gdb) x/16x (ie-ie_iobref-iobrefs - 8) 0xbb11a438: 0xbb18ba80 0x0001 0x0068 0x0040 0xbb11a448: 0xbb1e2018 0xcafebabe 0x 0x 0xbb11a458: 0x0003 0x0003 0x0008 0x0003 0xbb11a468: 0x000c 0x0003 0x000e 0x0003 8 bytes before the magic header (0xcafebabe) lives the xlator (this) that invoked GF_MALLOC. Here it's: (gdb) p *(xlator_t *)0xbb1e2018 $9 = {name = 0xbb1dbb08 patchy-server, type = 0xbb1dbb38 protocol/server, next = 0xbb1e1018, prev = 0x0, parents = 0x0, children = 0xbb1dbbc8, options = 0xbb18a028, dlhandle = 0xb9b7d000, fops = 0xb9adf0e0 fops, cbks = 0xb9adc8cc cbks, dumpops = 0xb9ade460 dumpops, volume_options = {next = 0xbb1dbb68, prev = 0xbb1dbbf8}, fini = 0xb9ab539d fini, init = 0xb9ab48a5 init, reconfigure = 0xb9ab418c reconfigure, mem_acct_init = 0xb9ab3cb1 mem_acct_init, notify = 0xb9ab53a3 notify, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} repeats 50 times}, history = 0x0, ctx = 0xbb109000, graph = 0xbb1c30f8, itable = 0x0, init_succeeded = 1 '\001', private = 0xbb1e3018, mem_acct = {num_types = 144, rec = 0xbb1c6000}, winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false} looking into it more. if the above strikes a bell to someone, let us know. -venky On Tue, Mar 24, 2015 at 11:28 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Mar 24, 2015 at 05:18:44PM +, Emmanuel Dreyfus wrote: Hi The merge of http://review.gluster.org/9953/ removed a few crashes from NetBSD regression tests, but the thing remains uterly broken since the merge of http://review.gluster.org/9708/ though I cannot tell if I have bugs leftover form this commit or if I face new problems. Here are the known problem so far: ...snip! I'll only give some info to your 2nd point. 2) I still experience memory corruption, which usually crash glsuterfsd because some pointer waas replaced by value 0x3. This strikes on iobref most of the time, but it can happens elsewhere. I would be glad if someone could help here. On nbslave70:/autobuild I added code to check for iobref/iobuf sanity at random place (by calling iobref_sanity()). I do this in synask_wrap and in STACK_WIND/UNWIND, but I have not been able to spot the source of the problem yet. The weird thing is that memory seems to always be overwritten by the same values, and magic 0xcafebabe number before the buffer is preserved. Here is an example: where iobref-iobrefs = 0xbb11a458 0xbb11a44c: 0xcafebabe 0x 0x 0x0003 0xbb11a45c: 0x0003 0x0008 0x0003 0x000c 0xbb11a46c: 0x0003 0x000e 0x0003 0x0010 0xbb11a47c: 0x0003 0x0009 0x0003 0x000d 0xbb11a48c: 0x0003 0x0015 0x0003 0x0016 0xbb11a49c: 0x0003 0x0032 0x0034 0xbb1e2018 0xbb11a4ac: 0xcafebabe 0x 0x 0xbb11a5d8 Recently I was looking into something that involved some more understanding of GF_MALLOC(). I did not really continue with it becase other things got a higher priority. But, maybe this layout helps you a little: : : : : +--+ | GF_MEM_TRAILER_MAGIC | +--+ | | | ... | | | +--+ | 8 bytes| +--+ | GF_MEM_HEADER_MAGIC | +--+ | *xlator_t | +--+ |size | +--+ |type | +--+ : : : : #define GF_MEM_HEADER_MAGIC 0xCAFEBABE #define GF_MEM_TRAILER_MAGIC 0xBAADF00D Because there is no 0xbaadfood in your memory dump, I would assume that the memory has just been allocated, and the 0xcafebabe at 0xbb11a4ac is a left over from a previous allocation. You could try to run a test with more strict memory enforcing. All the GF_ASSERT() calls will actually call abort() in that case, and it may make things a little easier to debug. You would pass --enable-debug to the configure commandline: $ ./configure --enable-debug I hope that we will be able to setup scheduled automated regression tests with --enable-debug build binaries. It may be helpful to catch unintended NULL usage a little earlier. HTH, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Responsibilities and expectations of our maintainers
Hi all, With many new features getting merged for glusterfs-3.7, I would like to get your attention to some of the more 'boring' bits that come with proposing and implementing a feature. 1. Who is going to maintain the new features? Features that are pretty self-contained, like adding a new xlator, daemon or the like, should get added to our MAINTAINERS file. Only very few features have provided patches for this, the others would need to still do so, or we can collect a bunch of them and add them all at once (might be easier to prevent conflicts). Some features only add some functionality to existing components. If the current component maintainer asks your support for maintaining your added changes, please be very responsive. 2. Maintainers should be active in responding to users As a maintainer of a component, you (or the group of maintainers) have the (end) responsibility to respond to questions and problems reported by users. This does not mean that you are required to respond to all questions and problems yourself, but try to track responses by others and answer outstanding questions. There are several ways our community users can ask questions and report problems: - gluster-us...@gluster.org, gluster-devel@gluster.org lists - #gluster and #gluster-dev on Freenode IRC - https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS Maintainers are expected to keep an eye on relevant topics, questions and bugs through these channels. 3. What about reported bugs, there is the Bug Triaging in place? Indeed, at the moment we have a weekly Community Bug Triage meeting. This meeting is intended as a fall-back for bugs that have not been triaged by community members (users, developers, managers, ...). It seems that most new bugs get triaged during the meeting, but this is an activity that can happen completely independently from the meeting. Maintainers and developers are strongly encouraged to triage bugs that are reported against the components that they work on. The following links contains the workflow for triaging, and bugzilla queries that show the untriaged bugs: - http://www.gluster.org/community/documentation/index.php/Bug_triage - https://public.pad.fsfe.org/p/gluster-bug-triage - http://www.gluster.org/pipermail/gluster-devel/2015-March/044114.html Reminder: anyone is welcome to join the Bug Triage meeting. 4. Maintainers should keep an eye on open bugs affecting their component When a bug has been triaged, someone would need to work on getting the problem fixed. Bugs move their status like this: NEW - NEW+Triaged - ASSIGNED - POST - MODIFIED - ... What happens after MODIFIED is for the release maintainers and QA (also community) teams. Maintainers would mostly focus on the first steps of the process. To assist with this, I have created a Bugzilla report where you can click on the component, or the component/status and get a list of all bugs (without FutureFeature keyword): - http://red.ht/1BKWsRq There is still an ongoing action item to find someone that has a good overview of how busy developers (mostly Red Hat) are and which community reported bugs should get fixed with priority. Maintainers are not expected to be managers that can force other developers to work on certain issues, but in most cases a friendly request does the trick too ;-) 5. Maintainers are expected to be responsive on patch reviews When a developer posts a patch to Gerrit, they are eager to hear about any changes they would need to make. Responding fast with a review also helps in getting the posted change updated quicker. Developers tend to switch between many tasks and having the change fresh in their memory helps with their responsiveness too. Our Guidelines for Maintainers list a few ways on how to receive email notifications and displaying a list of changes in Gerrit: - http://www.gluster.org/community/documentation/index.php/Guidelines_For_Maintainers 6. Maintainers should try to attend IRC meetings There is the weekly Gluster Community meeting on IRC. This is scheduled for every Wednesday. Maintainers and active developers are expected to attend these meetings whenever they can. More information about these meetings can be found here: - https://public.pad.fsfe.org/p/gluster-community-meetings Note that these are mostly the expectations I have of maintainers, and I try hard to fulfill them myself too. Let me know if you have any questions, objections, additions or ideas about this topic. When you reply, do so by inlining or bottom-posting your comments and feel free to trim unrelated parts of this email in your response. Thanks, Niels pgpUp9SemKysU.pgp Description: PGP signature
Re: [Gluster-devel] Responsibilities and expectations of our maintainers
On Wed, Mar 25, 2015 at 02:04:10PM +0100, Niels de Vos wrote: 1. Who is going to maintain the new features? 2. Maintainers should be active in responding to users 3. What about reported bugs, there is the Bug Triaging in place? 4. Maintainers should keep an eye on open bugs affecting their component 5. Maintainers are expected to be responsive on patch reviews 6. Maintainers should try to attend IRC meetings May I suggest a personnal item: 7. Check your feature does not break NetBSD regression NetBSD regression does not vote but is reported in gerrit. Please seek help resolving breakage before merging. -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] REMINDER: Weekly Gluster Community meeting today at 12:00 UTC
On 03/25/2015 04:23 PM, Vijay Bellur wrote: Hi all, In about 70 minutes from now we will have the regular weekly Gluster Community meeting. Meeting minutes from today can be found at [1]. Thanks, Vijay [1] http://meetbot.fedoraproject.org/gluster-meeting/2015-03-25/gluster-meeting.2015-03-25-12.00.html ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Update on 3.7 feature freeze branching
On 03/19/2015 11:52 PM, Vijay Bellur wrote: Hi All, The last few days have been busy for us with feature patches getting reviewed, rebased and merged. As of now we have been able to review and merge most features that we wanted to have in 3.7. Parts of bitrot detection tiering features have already been merged on mainline and we are awaiting few more patches from these areas to declare completion with respect to features. We seem to be well set to feature freeze 3.7 tomorrow. The missing pieces of bitrot detection and tiering are in now. So we are good to declare feature freeze now. Kudos to all of us in reaching here :). I will delay branching of 3.7 by a week or so after tomorrow to primarily include several bug fixes and minor improvements in areas like logging that have not received enough attention from us in the recent past. I will send out an update as we have a release-3.7 branch. Branching for 3.7 will happen after we merge coverity fixes, logging improvements and critical bug fixes. Maintainers - can you please provide an ACK for sanity of the quality of your component(s) in 3.7? Basically we should aim to avoid functional regressions and critical issues before we branch. Once we have both of these in place, I will branch release-3.7. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] New Defects reported by Coverity Scan for gluster/glusterfs
Hi, Please find the latest report on new defect(s) introduced to gluster/glusterfs found with Coverity Scan. 33 new defect(s) introduced to gluster/glusterfs found with Coverity Scan. 9 defect(s), reported by Coverity Scan earlier, were marked fixed in the recent build analyzed by Coverity Scan. New defect(s) Reported-by: Coverity Scan Showing 20 of 33 defect(s) ** CID 1291734: Error handling issues (CHECKED_RETURN) /xlators/cluster/dht/src/tier.c: 451 in tier_build_migration_qfile() *** CID 1291734: Error handling issues (CHECKED_RETURN) /xlators/cluster/dht/src/tier.c: 451 in tier_build_migration_qfile() 445 { 446 gfdb_time_t current_time; 447 _gfdb_brick_dict_info_t gfdb_brick_dict_info; 448 gfdb_time_t time_in_past; 449 int ret = -1; 450 CID 1291734: Error handling issues (CHECKED_RETURN) Calling remove((is_promotion ? /var/run/gluster/promotequeryfile : /var/run/gluster/demotequeryfile)) without checking return value. This library function may fail and return an error code. 451 remove (GET_QFILE_PATH (is_promotion)); 452 time_in_past.tv_sec = args-freq_time; 453 time_in_past.tv_usec = 0; 454 if (gettimeofday (current_time, NULL) == -1) { 455 gf_log (args-this-name, GF_LOG_ERROR, 456 Failed to get current timen); To view the defects in Coverity Scan visit, https://scan.coverity.com/projects/987?tab=overview To manage Coverity Scan email notifications for gluster-devel@gluster.org, click https://scan.coverity.com/subscriptions/edit?email=gluster-devel%40gluster.orgtoken=7dffab14bc5a7180e75b0d047539f148 . ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regressions, memory corruption
On Wed, Mar 25, 2015 at 10:32:08PM +0530, Venky Shankar wrote: Could I run some tests on nbslave70 Sure, I stopped doing tests since I assumed you were alredy doing some. (I plan to disable some translators). Just running AFR test cases should trigger the segfault, correct? Yes, if you are lucky you can pass one or two tests, but it crashes quite reliably. I already tries to disable some translators: recently introduced upcall for instance, and of course changelog, which is the componenent that was modified when things started to break. The problem is that absence of the translatoor seems to break the tests because of the lack of functionnality. -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Update on 3.7 feature freeze branching
On 03/25/2015 10:43 PM, Vijay Bellur wrote: On 03/19/2015 11:52 PM, Vijay Bellur wrote: Hi All, The last few days have been busy for us with feature patches getting reviewed, rebased and merged. As of now we have been able to review and merge most features that we wanted to have in 3.7. Parts of bitrot detection tiering features have already been merged on mainline and we are awaiting few more patches from these areas to declare completion with respect to features. We seem to be well set to feature freeze 3.7 tomorrow. The missing pieces of bitrot detection and tiering are in now. So we are good to declare feature freeze now. Kudos to all of us in reaching here :). I will delay branching of 3.7 by a week or so after tomorrow to primarily include several bug fixes and minor improvements in areas like logging that have not received enough attention from us in the recent past. I will send out an update as we have a release-3.7 branch. Branching for 3.7 will happen after we merge coverity fixes, logging improvements and critical bug fixes. Maintainers - can you please provide an ACK for sanity of the quality of your component(s) in 3.7? Basically we should aim to avoid functional regressions and critical issues before we branch. Once we have both of these in place, I will branch release-3.7. Forgot to add - one more criterion for branching as discussed in today's community meeting is to have no known spurious regression test failures. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regressions, memory corruption
On Wed, Mar 25, 2015 at 4:28 PM, Niels de Vos nde...@redhat.com wrote: Ai, top posting, this makes it really difficult to follow the email if you have not read the first parts :-/ Please remember to inline or bottom post when replying. On Wed, Mar 25, 2015 at 03:21:28PM +0530, Venky Shankar wrote: looks like the iobref (and the iobuf) was allocated in protocol/server.. (gdb) x/16x (ie-ie_iobref-iobrefs - 8) 0xbb11a438: 0xbb18ba80 0x0001 0x0068 0x0040 0xbb11a448: 0xbb1e2018 0xcafebabe 0x 0x 0xbb11a458: 0x0003 0x0003 0x0008 0x0003 0xbb11a468: 0x000c 0x0003 0x000e 0x0003 8 bytes before the magic header (0xcafebabe) lives the xlator (this) that invoked GF_MALLOC. Here it's: (gdb) p *(xlator_t *)0xbb1e2018 $9 = {name = 0xbb1dbb08 patchy-server, type = 0xbb1dbb38 protocol/server, next = 0xbb1e1018, prev = 0x0, parents = 0x0, children = 0xbb1dbbc8, options = 0xbb18a028, dlhandle = 0xb9b7d000, fops = 0xb9adf0e0 fops, cbks = 0xb9adc8cc cbks, dumpops = 0xb9ade460 dumpops, volume_options = {next = 0xbb1dbb68, prev = 0xbb1dbbf8}, fini = 0xb9ab539d fini, init = 0xb9ab48a5 init, reconfigure = 0xb9ab418c reconfigure, mem_acct_init = 0xb9ab3cb1 mem_acct_init, notify = 0xb9ab53a3 notify, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} repeats 50 times}, history = 0x0, ctx = 0xbb109000, graph = 0xbb1c30f8, itable = 0x0, init_succeeded = 1 '\001', private = 0xbb1e3018, mem_acct = {num_types = 144, rec = 0xbb1c6000}, winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false} looking into it more. if the above strikes a bell to someone, let us know. Going by the output from gdb above and the below layout: $ printf 'type=%d\nsize=%d\n' 0x0068 0x0040 type=104 size=64 This means that the protocol/server did a GF_?ALLOC(64, 104). The 104 is an enum for the mem-type and libglusterfs/src/mem-types.h points to gf_common_mt_iobrefs. There is only one function that uses gf_common_mt_iobrefs, which is iobref_new(). protocol/server calls iobref_new() only once directly (there could be some other indirect calls too) in server_submit_reply(). yes, that's the only place in protocol/server than calls iobref_new(). I do not quickly see how the issue can happen with the analyzed data in this email. Possibly an allocation before (memory address wise) this went awry and caused the wreckage. We may need to follow these diagnostic steps back upwards and try to find the first occurrence where 0xcafebabe is followed by 0xcafebabe instead of 0xbaadf00d. What's interesting is the number of used iobufs is zero but -iobrefs points to a memory address (iobref_unref() iterates -alloced times and frees anything which isn't NULL). There's someone who put it there. (gdb) p *ie-ie_iobref $1 = {lock = {pts_magic = 2004287495, pts_spin = 0 '\000', pts_flags = 0}, ref = 1, iobrefs = 0xbb11a458, alloced = 16, used = 0} Emmanuel, Could I run some tests on nbslave70 (I plan to disable some translators). Just running AFR test cases should trigger the segfault, correct? That's the only idea I have for now, but I'll keep thinking of something that could make this easier. Note: the iobref structure is used really a lot, this makes it a likely structure to blow away other structures when something else frees some memory, but wants to use it afterwards. I think a use-after-free could be one cause for this. Niels -venky On Tue, Mar 24, 2015 at 11:28 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Mar 24, 2015 at 05:18:44PM +, Emmanuel Dreyfus wrote: Hi The merge of http://review.gluster.org/9953/ removed a few crashes from NetBSD regression tests, but the thing remains uterly broken since the merge of http://review.gluster.org/9708/ though I cannot tell if I have bugs leftover form this commit or if I face new problems. Here are the known problem so far: ...snip! I'll only give some info to your 2nd point. 2) I still experience memory corruption, which usually crash glsuterfsd because some pointer waas replaced by value 0x3. This strikes on iobref most of the time, but it can happens elsewhere. I would be glad if someone could help here. On nbslave70:/autobuild I added code to check for iobref/iobuf sanity at random place (by calling iobref_sanity()). I do this in synask_wrap and in STACK_WIND/UNWIND, but I have not been able to spot the source of the problem yet. The weird thing is that memory seems to always be overwritten by the same values, and magic 0xcafebabe number before the buffer is preserved. Here is an example: where iobref-iobrefs = 0xbb11a458 0xbb11a44c: 0xcafebabe 0x 0x 0x0003 0xbb11a45c: 0x0003 0x0008 0x0003 0x000c
Re: [Gluster-devel] New Defects reported by Coverity Scan for gluster/glusterfs
On Wed, Mar 25, 2015 at 10:59:24AM -0700, scan-ad...@coverity.com wrote: Hi, Please find the latest report on new defect(s) introduced to gluster/glusterfs found with Coverity Scan. 33 new defect(s) introduced to gluster/glusterfs found with Coverity Scan. 9 defect(s), reported by Coverity Scan earlier, were marked fixed in the recent build analyzed by Coverity Scan. New defect(s) Reported-by: Coverity Scan Showing 20 of 33 defect(s) ** CID 1291734: Error handling issues (CHECKED_RETURN) /xlators/cluster/dht/src/tier.c: 451 in tier_build_migration_qfile() Dan already posted a fix for this: http://review.gluster.org/1 Niels *** CID 1291734: Error handling issues (CHECKED_RETURN) /xlators/cluster/dht/src/tier.c: 451 in tier_build_migration_qfile() 445 { 446 gfdb_time_t current_time; 447 _gfdb_brick_dict_info_t gfdb_brick_dict_info; 448 gfdb_time_t time_in_past; 449 int ret = -1; 450 CID 1291734: Error handling issues (CHECKED_RETURN) Calling remove((is_promotion ? /var/run/gluster/promotequeryfile : /var/run/gluster/demotequeryfile)) without checking return value. This library function may fail and return an error code. 451 remove (GET_QFILE_PATH (is_promotion)); 452 time_in_past.tv_sec = args-freq_time; 453 time_in_past.tv_usec = 0; 454 if (gettimeofday (current_time, NULL) == -1) { 455 gf_log (args-this-name, GF_LOG_ERROR, 456 Failed to get current timen); To view the defects in Coverity Scan visit, https://scan.coverity.com/projects/987?tab=overview To manage Coverity Scan email notifications for gluster-devel@gluster.org, click https://scan.coverity.com/subscriptions/edit?email=gluster-devel%40gluster.orgtoken=7dffab14bc5a7180e75b0d047539f148 . ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] New features and their location in packages
With all the new features merged, we need to know on what side of the system the new xlators, libraries and scripts are used. There always are questions on reducing the installation size on client systems, so anything that is not strictly needed client-side, should not be in the client packages. For example, I would like to hear from all feature owners, which files, libraries, scripts, docs or other bits are required for clients to operate. For example, here is what tiering does: - client-side: cluster/tiering xlator - server-side: libgfdb (with sqlite dependency) By default, any library is included in the glusterfs-libs RPM. If the library is only useful on a system with glusterd installed, it should move to the glusterfs-server RPM. Clients include fuse and libgfapi, the common package for client-side bits (and shared client/server bits) is 'glusterfs'. There is no need for you to post patches for moving files around in the RPMs, that is something I can do in one go. Just let me know which files are needed where. Thanks! Niels pgpb4undkZmA1.pgp Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] REMINDER: Weekly Gluster Community meeting today at 12:00 UTC
Hi all, In about 70 minutes from now we will have the regular weekly Gluster Community meeting. Meeting details: - location: #gluster-meeting on Freenode IRC - date: every Wednesday - time: 8:00 EDT, 12:00 UTC, 13:00 CET, 17:30 IST (in your terminal, run: date -d 12:00 UTC) - agenda: available at [1] Currently the following items are listed: * Roll Call * Status of last week's action items * GlusterFS 3.6 * GlusterFS 3.5 * GlusterFS 3.4 * GlusterFS Next * Open Floor - Fix regression tests with spurious failures - docs - Awesum Web Presence - Gluster Summit Barcelona, second week in May - Gluster Summer of Code The last topic has space for additions. If you have a suitable topic to discuss, please add it to the agenda. Thanks, Vijay [1] https://public.pad.fsfe.org/p/gluster-community-meetings ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regressions, memory corruption
Ai, top posting, this makes it really difficult to follow the email if you have not read the first parts :-/ Please remember to inline or bottom post when replying. On Wed, Mar 25, 2015 at 03:21:28PM +0530, Venky Shankar wrote: looks like the iobref (and the iobuf) was allocated in protocol/server.. (gdb) x/16x (ie-ie_iobref-iobrefs - 8) 0xbb11a438: 0xbb18ba80 0x0001 0x0068 0x0040 0xbb11a448: 0xbb1e2018 0xcafebabe 0x 0x 0xbb11a458: 0x0003 0x0003 0x0008 0x0003 0xbb11a468: 0x000c 0x0003 0x000e 0x0003 8 bytes before the magic header (0xcafebabe) lives the xlator (this) that invoked GF_MALLOC. Here it's: (gdb) p *(xlator_t *)0xbb1e2018 $9 = {name = 0xbb1dbb08 patchy-server, type = 0xbb1dbb38 protocol/server, next = 0xbb1e1018, prev = 0x0, parents = 0x0, children = 0xbb1dbbc8, options = 0xbb18a028, dlhandle = 0xb9b7d000, fops = 0xb9adf0e0 fops, cbks = 0xb9adc8cc cbks, dumpops = 0xb9ade460 dumpops, volume_options = {next = 0xbb1dbb68, prev = 0xbb1dbbf8}, fini = 0xb9ab539d fini, init = 0xb9ab48a5 init, reconfigure = 0xb9ab418c reconfigure, mem_acct_init = 0xb9ab3cb1 mem_acct_init, notify = 0xb9ab53a3 notify, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} repeats 50 times}, history = 0x0, ctx = 0xbb109000, graph = 0xbb1c30f8, itable = 0x0, init_succeeded = 1 '\001', private = 0xbb1e3018, mem_acct = {num_types = 144, rec = 0xbb1c6000}, winds = 0, switched = 0 '\000', local_pool = 0x0, is_autoloaded = _gf_false} looking into it more. if the above strikes a bell to someone, let us know. Going by the output from gdb above and the below layout: $ printf 'type=%d\nsize=%d\n' 0x0068 0x0040 type=104 size=64 This means that the protocol/server did a GF_?ALLOC(64, 104). The 104 is an enum for the mem-type and libglusterfs/src/mem-types.h points to gf_common_mt_iobrefs. There is only one function that uses gf_common_mt_iobrefs, which is iobref_new(). protocol/server calls iobref_new() only once directly (there could be some other indirect calls too) in server_submit_reply(). I do not quickly see how the issue can happen with the analyzed data in this email. Possibly an allocation before (memory address wise) this went awry and caused the wreckage. We may need to follow these diagnostic steps back upwards and try to find the first occurrence where 0xcafebabe is followed by 0xcafebabe instead of 0xbaadf00d. That's the only idea I have for now, but I'll keep thinking of something that could make this easier. Note: the iobref structure is used really a lot, this makes it a likely structure to blow away other structures when something else frees some memory, but wants to use it afterwards. I think a use-after-free could be one cause for this. Niels -venky On Tue, Mar 24, 2015 at 11:28 PM, Niels de Vos nde...@redhat.com wrote: On Tue, Mar 24, 2015 at 05:18:44PM +, Emmanuel Dreyfus wrote: Hi The merge of http://review.gluster.org/9953/ removed a few crashes from NetBSD regression tests, but the thing remains uterly broken since the merge of http://review.gluster.org/9708/ though I cannot tell if I have bugs leftover form this commit or if I face new problems. Here are the known problem so far: ...snip! I'll only give some info to your 2nd point. 2) I still experience memory corruption, which usually crash glsuterfsd because some pointer waas replaced by value 0x3. This strikes on iobref most of the time, but it can happens elsewhere. I would be glad if someone could help here. On nbslave70:/autobuild I added code to check for iobref/iobuf sanity at random place (by calling iobref_sanity()). I do this in synask_wrap and in STACK_WIND/UNWIND, but I have not been able to spot the source of the problem yet. The weird thing is that memory seems to always be overwritten by the same values, and magic 0xcafebabe number before the buffer is preserved. Here is an example: where iobref-iobrefs = 0xbb11a458 0xbb11a44c: 0xcafebabe 0x 0x 0x0003 0xbb11a45c: 0x0003 0x0008 0x0003 0x000c 0xbb11a46c: 0x0003 0x000e 0x0003 0x0010 0xbb11a47c: 0x0003 0x0009 0x0003 0x000d 0xbb11a48c: 0x0003 0x0015 0x0003 0x0016 0xbb11a49c: 0x0003 0x0032 0x0034 0xbb1e2018 0xbb11a4ac: 0xcafebabe 0x 0x 0xbb11a5d8 Recently I was looking into something that involved some more understanding of GF_MALLOC(). I did not really continue with it becase other things got a higher priority. But, maybe this layout helps you a little: : : : :