Re: list corruption in IPOIB

2013-05-20 Thread Jinpu Wang
which list_del do you mean? in ipoib_cm_tx_start?

On Mon, May 20, 2013 at 11:05 AM, Or Gerlitz ogerl...@mellanox.com wrote:
 On 19/05/2013 12:17, Jack Wang wrote:

 we added inject_bug sysfs node to make function run into error case, like
 something below. Yes, you are right, we want to speedup the bug reproduce
 process, and we saw the warning and come to conclusion the neigh-list
 corrupted some where. What's your opinion?


 Yes, for the synthetic experiment you made, there's a possible point here:

 Under the CM error flow (e.g your injected error), we delete a neighbour,
 but we also do it from the flow that flush neighbours (e.g when changing the
 device mode from UD to CM, as you did in your script). When this happens
 concurrently, these two code pieces call for deleting the neighbour from the
 list. So the spinlock might not be enough and we should have do
 list_del_init(neigh-list) instead of list_del, helps?

 Or.



-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 6098 56991-308
Fax: +49 30 6098 56992-203
Email: jinpu.w...@profitbricks.com
URL: http://www.profitbricks.com

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Or Gerlitz

On 20/05/2013 12:10, Jinpu Wang wrote:

which list_del do you mean? in ipoib_cm_tx_start?
yes, but not only, you can start with 5KG hammer and convert all 
thesehits to list_del_init


linux-2.6]# grep list_del drivers/infiniband/ulp/ipoib/*.c | grep neigh
drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


libibverbs / libmlx4 release

2013-05-20 Thread Or Gerlitz

Hi Roland,

Following what we discussed last week during the Linux Foundation EU 
summit, I think it would be good to follow what you said and have a 
point release for libibverbs and libmlx4 before we pull in the verbs 
extensions framework and features that use it (XRC, Flow-Steering, etc 
more fun).


I mentioned to you that we have some more libmlx4 patches, but its 
totally OK for us to submit them after that release, makes sense?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Jinpu Wang
A quick test show the list_corruption warning is gone, after I convert
 all list_del(neigh-list) to  list_del_list(neigh-list).

Test is still running, will update status if anything wrong.

Thanks Or.



On Mon, May 20, 2013 at 12:58 PM, Or Gerlitz ogerl...@mellanox.com wrote:
 On 20/05/2013 12:10, Jinpu Wang wrote:

 which list_del do you mean? in ipoib_cm_tx_start?

 yes, but not only, you can start with 5KG hammer and convert all thesehits
 to list_del_init

 linux-2.6]# grep list_del drivers/infiniband/ulp/ipoib/*.c | grep neigh
 drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_cm.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);
 drivers/infiniband/ulp/ipoib/ipoib_main.c: list_del(neigh-list);




-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 6098 56991-308
Fax: +49 30 6098 56992-203
Email: jinpu.w...@profitbricks.com
URL: http://www.profitbricks.com

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Or Gerlitz

On 20/05/2013 15:46, Jinpu Wang wrote:

A quick test show the list_corruption warning is gone, after I convert
  all list_del(neigh-list) to  list_del_list(neigh-list).


yes, but this wasn't your original problem or was it?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Shlomo Pongratz

On 5/20/2013 3:58 PM, Jack Wang wrote:
I haven't reproduced the original bug we saw in our production 
environment

BUG: unable to handle kernel
  at 0008
IP: [a0206c30] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
 ...
RIP: 0010:[a0206c30] [a0206c30] 
ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]

RSP: 0018:8825fdcbddb0  EFLAGS: 00010086
RAX: 0246 RBX: 8807b59c29c0 RCX: 
RDX: 44000602 RSI: 0246 RDI: 8810026527c0
RBP: 881002652000 R08: 00015360 R09: dead00200200
R10: dead00100100 R11: 0001 R12: 0001
R13:  R14: 8810026523a0 R15: 8810026527c0
FS:  7f4c9a325700() GS:880807c0() 
knlGS:

CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0008 CR3: 002605e3a000 CR4: 000407f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process kworker/u:3 (pid: 61374, threadinfo 8825fdcbc000, task 
8807fd0eafb0)

Stack:
 8820043303c0
 880807d52700
 8807fd0eafb0
 8825fdcbdde0

 8810026533b8
 a0039868
 8825fdcbdde0
 8805fd549a00

 81b9d480
 8807fd2f4000
 a0206b50
 

Call Trace:
 [a0039868] ? process_req+0xe8/0x1a0 [ib_addr]
 [a0206b50] ? ipoib_cm_tx_handler+0x2d0/0x2d0 [ib_ipoib]
 [81052d64] ? process_one_work+0x114/0x470
 [81055033] ? worker_thread+0x163/0x3e0
 [81054ed0] ? manage_workers+0x200/0x200
 [81054ed0] ? manage_workers+0x200/0x200
 [8105963e] ? kthread+0x9e/0xb0
 [8167e9e4] ? kernel_thread_helper+0x4/0x10
 [810595a0] ? kthread_freezable_should_stop+0x60/0x60
 [8167e9e0] ? gs_change+0x13/0x13
 ...
 [a01fec30] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
 RSP 881d275f1db0
---[ end trace 38ff082cbc03dd75 ]---
Kernel panic - not syncing: Fatal exception in interrupt



, only the A variant of the crash in has been reproduced:

WARNING: at lib/list_debug.c:49 __list_del_entry+0x63/0xd0()
Hardware name: System Product Name
list_del corruption, 88020dbd3080-next is LIST_POISON1 
(dead00100100)

Modules linked in: ...
Pid: 16248, comm: iperf Tainted: GW  3.4.23-pserver+ #76
Call Trace:
 IRQ  [8103c21f] warn_slowpath_common+0x7f/0xc0
 [8103c316] warn_slowpath_fmt+0x46/0x50
 [81428563] ? do_raw_spin_lock+0xd3/0x140
 [81428883] __list_del_entry+0x63/0xd0
 [81428901] list_del+0x11/0x40
 [a02f64c5] ipoib_cm_handle_tx_wc+0x225/0x380 [ib_ipoib]
 [a02eea44] ipoib_poll+0x164/0x190 [ib_ipoib]
 [815d91fd] net_rx_action+0x13d/0x320
 [81044f29] ? __do_softirq+0x89/0x380
 [81044f98] __do_softirq+0xf8/0x380
 [8174632c] call_softirq+0x1c/0x30
 EOI  [81004305] do_softirq+0x95/0xd0
 [815daacc] ? dev_queue_xmit+0x29c/0xbf0
 [8104461b] local_bh_enable+0xeb/0xf0
 [815daacc] dev_queue_xmit+0x29c/0xbf0
 [815da830] ? ptype_seq_start+0xb0/0xb0
 [815e0d87] neigh_connected_output+0xc7/0x110
 [8109f36d] ? trace_hardirqs_on+0xd/0x10
 [81617386] ip_finish_output2+0x1c6/0x460
 [8161723a] ? ip_finish_output2+0x7a/0x460
 [81619033] ip_finish_output+0xc3/0x230
 [81619510] ip_output+0xa0/0x110
 [8161764d] ip_local_out+0x2d/0x90
 [816176cb] ip_send_skb+0x1b/0x60
 [8163f27b] udp_send_skb+0x10b/0x380
 [815c3a70] ? sock_def_wakeup+0x1b0/0x1b0
 [81616e90] ? ip_append_page+0x530/0x530
 [81641462] udp_sendmsg+0x3b2/0xb50
 [8173c530] ? retint_restore_args+0x13/0x13
 [8164d9a0] ? inet_create+0x5b0/0x5b0
 [815c2310] ? sock_update_classid+0x150/0x2b0
 [8164dacb] inet_sendmsg+0x12b/0x240
 [8164d9a0] ? inet_create+0x5b0/0x5b0
 [815c2272] ? sock_update_classid+0xb2/0x2b0
 [815c2310] ? sock_update_classid+0x150/0x2b0
 [815bda40] sock_aio_write+0x190/0x1b0
 [8142214e] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [8116e82a] do_sync_write+0xea/0x130
 [8109bfdd] ? trace_hardirqs_off+0xd/0x10
 [811713d3] ? fget_light+0x43/0x490
 [813b14f3] ? security_file_permission+0x23/0x90
 [8116ee82] vfs_write+0x172/0x190
 [8116ef91] sys_write+0x51/0x90
 [81744de9] system_call_fastpath+0x16/0x1b
---[ end trace 66110390802a41db ]---

after apply
commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27
  Author: Shlomo Pongratz shlo...@mellanox.com 
mailto:shlo...@mellanox.com

  Date:   Mon Aug 13 14:39:49 2012 +

  IB/ipoib: Add missing locking when CM object is deleted

Above warning is gone, but we still see the warning at the begin of 
this thread.




2013/5/20 Or Gerlitz ogerl...@mellanox.com 
mailto:ogerl...@mellanox.com


On 20/05/2013 15:46, Jinpu Wang wrote:

A 

Re: list corruption in IPOIB

2013-05-20 Thread Jack Wang
 
 Hi Jack,
 
 I don't understand what is the current status, that is what do you see
 now after applying the patches.
 If you don't get the original bug why did you gave the trace of it? Or
 is it a new trace? It is not clear from your mail.
 Please add only the trace of the current issue.
 
 Best regards,
 
 S.P.

Hi Shlomo,

Sorry for confusion.
Current list corruption is gone in my preliminary test, after I changed
list_del to list_del_init as Or suggested.

As Or asked for the original bug, so I just want to show him the whole
story.

Regards,
Jack


 
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MLX4 Cq Question

2013-05-20 Thread Jack Morgenstein
On Saturday 18 May 2013 00:37, Roland Dreier wrote:
 On Fri, May 17, 2013 at 12:25 PM, Tom Tucker t...@opengridcomputing.com 
 wrote:
  I'm looking at the Linux MLX4 net driver and found something that confuses 
  me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the 
  mlx4_ib_completion function does not take any kind of lock when looking up 
  the SW CQ in the radix tree, however, the mlx4_cq_event function does. In 
  addition if I go look at the code paths where cq are removed from this 
  tree, they are protected by spin_lock_irq. So I am baffled at this point as 
  to what the locking strategy is and how this is supposed to work. I'm sure 
  I'm missing something and would greatly appreciate it if someone would 
  explain this.
 
 This is a bit tricky.  If you look at
 
 void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
 {
 struct mlx4_priv *priv = mlx4_priv(dev);
 struct mlx4_cq_table *cq_table = priv-cq_table;
 int err;
 
 err = mlx4_HW2SW_CQ(dev, NULL, cq-cqn);
 if (err)
 mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n,
 err, cq-cqn);
 
 synchronize_irq(priv-eq_table.eq[cq-vector].irq);
 
 spin_lock_irq(cq_table-lock);
 radix_tree_delete(cq_table-tree, cq-cqn);
 spin_unlock_irq(cq_table-lock);
 
 if (atomic_dec_and_test(cq-refcount))
 complete(cq-free);
 wait_for_completion(cq-free);
 
 mlx4_cq_free_icm(dev, cq-cqn);
 }
 
 you see that when freeing a CQ, we first do the HW2SW_CQ firmware
 command; once this command completes, no more events will be generated
 for that CQ.  Then we do synchronize_irq for the CQ's interrupt
 vector.  Once that completes, no more completion handlers will be
 running for the CQ, so we can safely delete the CQ from the radix tree
 (relying on the radix tree's safety of deleting one entry while
 possibly looking up other entries, so no lock is needed).  We also use
 the lock to synchronize against the CQ event function, which as you
 noted does take the lock too.
 
 Basic idea is that we're tricky and careful so we can make the fast
 path (completion interrupt handling) lock-free, but then use locks and
 whatever else needed in the slow path (CQ async event handling, CQ
 destroy).
 
  - R.
===

Roland, unfortunately we have seen that we need some locking on the
cq completion handler (there is a stack trace which resulted from this
lack of proper locking).
In our current driver, we are using the patch below (which uses RCU locking
instead of spinlocks).  I can prepare a proper patch for the upstream kernel.

===
net/mlx4_core: Fix racy flow in the driver CQ completion handler

The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock
the radix tree which is used to manage the table of CQs, nor does it increase
the reference count of the CQ before invoking the user provided callback
(and decrease it afterwards).

This is racy and can cause use-after-free, null pointer dereference, etc, which
result in kernel crashes.

To fix this, we must do the following in mlx4_cq_completion:
- increase the ref count on the cq before invoking the user callback, and
  decrement it after the callback.
- Place a lock around the radix tree lookup/ref-count-increase

Using an irq spinlock will not fix this issue. The problem is that under VPI,
the ETH interface uses multiple msix irq's, which can result in one cq 
completion
event interrupting another in-progress cq completion event. A deadlock results
when the handler for the first cq completion grabs the spinlock, and is
interrupted by the second completion before it has a chance to release the 
spinlock.
The handler for the second completion will deadlock waiting for the spinlock
to be released.

The proper fix is to use the RCU mechanism for locking radix-tree accesses in
the cq completion event handler (The radix-tree implementation uses the RCU
mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the
updater, will do the job).

Note that the same issue exists in mlx4_cq_event() (the cq async event
handler), which also takes the same lock on the radix tree. Here, we replace the
spinlock with an rcu_read_lock().

This patch was motivated by the following report from the field:

[...] box panic'ed when trying to find a completion queue. There is
no corruption but there is a possible race which could result in
mlx4_cq_completion getting wrong height of the radix tree and
following a bit too deep into the chains. In the other code which uses
this radix tree the access is protected by the lock but
mlx4_cq_completion is running in the interrupt context and cannot
take locks, so instead it runs without any protection whatsoever.

The stack trace below is from the mlnx ofed 1.5.3 driver running under RHEL5.7.
(this driver uses the upstream kernel 

Re: MLX4 Cq Question

2013-05-20 Thread Roland Dreier
On Mon, May 20, 2013 at 7:53 AM, Jack Morgenstein
ja...@dev.mellanox.co.il wrote:
 This is racy and can cause use-after-free, null pointer dereference, etc, 
 which
 result in kernel crashes.

Sounds fine and I'd be happy to apply your final patch, but I'd be
curious to know what the race is in more detail.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libibverbs / libmlx4 release

2013-05-20 Thread Roland Dreier
On Mon, May 20, 2013 at 5:37 AM, Or Gerlitz ogerl...@mellanox.com wrote:
 Following what we discussed last week during the Linux Foundation EU summit, 
 I think it would be good to follow what you said and have a point release for 
 libibverbs and libmlx4 before we pull in the verbs extensions framework and 
 features that use it (XRC, Flow-Steering, etc more fun).

 I mentioned to you that we have some more libmlx4 patches, but its totally OK 
 for us to submit them after that release, makes sense?

That's fine, I'll do the releases this week.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs v5 5/7] libibverbs: Add ibv_open_qp

2013-05-20 Thread Steve Wise

On 3/26/2013 4:14 PM, sean.he...@intel.com wrote:

From: Sean Hefty sean.he...@intel.com

XRC receive QPs are shareable across multiple processes.  Allow
any process with access to the xrc domain to open an existing
QP.  After opening the QP, the process will receive events
related to the QP and be able to modify the QP.



Hey Sean,

So is it intended that cooperating processes sharing an xrc domain will 
choose one process to create the xrc qp, and the rest will open it?


Steve.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH libibverbs v5 5/7] libibverbs: Add ibv_open_qp

2013-05-20 Thread Hefty, Sean
 So is it intended that cooperating processes sharing an xrc domain will
 choose one process to create the xrc qp, and the rest will open it?

yes - The QPN of the shared QP must be known between the cooperating processes.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libibverbs / libmlx4 release

2013-05-20 Thread Doug Ledford
On 05/20/2013 12:49 PM, Roland Dreier wrote:
 On Mon, May 20, 2013 at 5:37 AM, Or Gerlitz ogerl...@mellanox.com wrote:
 Following what we discussed last week during the Linux Foundation EU summit, 
 I think it would be good to follow what you said and have a point release 
 for libibverbs and libmlx4 before we pull in the verbs extensions framework 
 and features that use it (XRC, Flow-Steering, etc more fun).

 I mentioned to you that we have some more libmlx4 patches, but its totally 
 OK for us to submit them after that release, makes sense?
 
 That's fine, I'll do the releases this week.

If you haven't already done so, can you please update to the latest
autotools and reroll the autotools files when you do the release?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Or Gerlitz
On Mon, May 20, 2013 at 5:36 PM, Jack Wang jinpu.w...@profitbricks.com wrote:
 Sorry for confusion. Current list corruption is gone in my preliminary test, 
 after I changed
 list_del to list_del_init as Or suggested.
 As Or asked for the original bug, so I just want to show him the whole story.

I am still not clear if  the bug you saw in your production
environment is gone with the list_del_init patch applied, please
clarify.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rockets feedbacks?

2013-05-20 Thread Or Gerlitz
Hi Sean,

Do we have some public quoted usages/feedback for rsockets? I think
you've mentioned something during the panel at the Linux EU summit
last week but I am not sure...

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Jack Wang
On 2013年05月20日 21:00, Or Gerlitz wrote:
 On Mon, May 20, 2013 at 5:36 PM, Jack Wang jinpu.w...@profitbricks.com 
 wrote:
 Sorry for confusion. Current list corruption is gone in my preliminary test, 
 after I changed
 list_del to list_del_init as Or suggested.
 As Or asked for the original bug, so I just want to show him the whole story.
 
 I am still not clear if  the bug you saw in your production
 environment is gone with the list_del_init patch applied, please
 clarify.
 
 Or.
 
The bug in our production environment is introduced in our backport
about ipoib fixes from mainline, and when we hit that bug we reverted
back to old kernel without the backport patch, and the bug didn't happen
for now.

This list_del_init patch do fix list corruption warning, but it's not
the one we hit in production, the list corruption is reproduced in our
test setup with bug injection patch  iperf -P 50  mode switch.

Is this clear for you now?

Jack
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Or Gerlitz
On Mon, May 20, 2013 at 10:38 PM, Jack Wang jinpu.w...@profitbricks.com wrote:

 The bug in our production environment is introduced in our backport
 about ipoib fixes from mainline, and when we hit that bug we reverted
 back to old kernel without the backport patch, and the bug didn't happen for 
 now.

 This list_del_init patch do fix list corruption warning, but it's not the one 
 we hit in production, the list corruption is reproduced in our test setup 
 with bug injection patch  iperf -P 50  mode switch.

 Is this clear for you now?


NO, you say that the list_del_init patch eliminates the list
corruption warning, does the list corruption is still reproduced in
your test setup even when  the patch is applied?! what's the trace?
and what is the trace you see in your production when using kernel X
(which?) patches with commit Y (which?)

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MLX4 Cq Question

2013-05-20 Thread Tom Tucker

Hi Guys,

One other quick one. I've received conflicting claims on the validity of 
the wc.opcode when wc.status != 0 for mlx4 hardware.


My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe 
owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an 
error and therefore, the only way to recover what the opcode was is 
through the wr_id you used when submitting the WR.


Is my reading of the code correct?

Thanks,
Tom

On 5/20/13 9:53 AM, Jack Morgenstein wrote:

On Saturday 18 May 2013 00:37, Roland Dreier wrote:

On Fri, May 17, 2013 at 12:25 PM, Tom Tucker t...@opengridcomputing.com wrote:

I'm looking at the Linux MLX4 net driver and found something that confuses me 
mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the 
mlx4_ib_completion function does not take any kind of lock when looking up the 
SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition 
if I go look at the code paths where cq are removed from this tree, they are 
protected by spin_lock_irq. So I am baffled at this point as to what the 
locking strategy is and how this is supposed to work. I'm sure I'm missing 
something and would greatly appreciate it if someone would explain this.

This is a bit tricky.  If you look at

void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
{
 struct mlx4_priv *priv = mlx4_priv(dev);
 struct mlx4_cq_table *cq_table = priv-cq_table;
 int err;

 err = mlx4_HW2SW_CQ(dev, NULL, cq-cqn);
 if (err)
 mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n,
err, cq-cqn);

 synchronize_irq(priv-eq_table.eq[cq-vector].irq);

 spin_lock_irq(cq_table-lock);
 radix_tree_delete(cq_table-tree, cq-cqn);
 spin_unlock_irq(cq_table-lock);

 if (atomic_dec_and_test(cq-refcount))
 complete(cq-free);
 wait_for_completion(cq-free);

 mlx4_cq_free_icm(dev, cq-cqn);
}

you see that when freeing a CQ, we first do the HW2SW_CQ firmware
command; once this command completes, no more events will be generated
for that CQ.  Then we do synchronize_irq for the CQ's interrupt
vector.  Once that completes, no more completion handlers will be
running for the CQ, so we can safely delete the CQ from the radix tree
(relying on the radix tree's safety of deleting one entry while
possibly looking up other entries, so no lock is needed).  We also use
the lock to synchronize against the CQ event function, which as you
noted does take the lock too.

Basic idea is that we're tricky and careful so we can make the fast
path (completion interrupt handling) lock-free, but then use locks and
whatever else needed in the slow path (CQ async event handling, CQ
destroy).

  - R.

===

Roland, unfortunately we have seen that we need some locking on the
cq completion handler (there is a stack trace which resulted from this
lack of proper locking).
In our current driver, we are using the patch below (which uses RCU locking
instead of spinlocks).  I can prepare a proper patch for the upstream kernel.

===
net/mlx4_core: Fix racy flow in the driver CQ completion handler

The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock
the radix tree which is used to manage the table of CQs, nor does it increase
the reference count of the CQ before invoking the user provided callback
(and decrease it afterwards).

This is racy and can cause use-after-free, null pointer dereference, etc, which
result in kernel crashes.

To fix this, we must do the following in mlx4_cq_completion:
- increase the ref count on the cq before invoking the user callback, and
   decrement it after the callback.
- Place a lock around the radix tree lookup/ref-count-increase

Using an irq spinlock will not fix this issue. The problem is that under VPI,
the ETH interface uses multiple msix irq's, which can result in one cq 
completion
event interrupting another in-progress cq completion event. A deadlock results
when the handler for the first cq completion grabs the spinlock, and is
interrupted by the second completion before it has a chance to release the 
spinlock.
The handler for the second completion will deadlock waiting for the spinlock
to be released.

The proper fix is to use the RCU mechanism for locking radix-tree accesses in
the cq completion event handler (The radix-tree implementation uses the RCU
mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the
updater, will do the job).

Note that the same issue exists in mlx4_cq_event() (the cq async event
handler), which also takes the same lock on the radix tree. Here, we replace the
spinlock with an rcu_read_lock().

This patch was motivated by the following report from the field:

[...] box panic'ed when trying to find a completion queue. There is
no corruption but there is a possible race which could 

RE: rockets feedbacks?

2013-05-20 Thread Hefty, Sean
 Do we have some public quoted usages/feedback for rsockets? I think
 you've mentioned something during the panel at the Linux EU summit
 last week but I am not sure...

Most feedback I can think of has come via private emails or personal 
interactions, especially specific details of various usage models.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list corruption in IPOIB

2013-05-20 Thread Jack Wang
On 2013年05月20日 21:50, Or Gerlitz wrote:
 On Mon, May 20, 2013 at 10:38 PM, Jack Wang jinpu.w...@profitbricks.com 
 wrote:

 The bug in our production environment is introduced in our backport
 about ipoib fixes from mainline, and when we hit that bug we reverted
 back to old kernel without the backport patch, and the bug didn't happen for 
 now.

 This list_del_init patch do fix list corruption warning, but it's not the 
 one we hit in production, the list corruption is reproduced in our test 
 setup with bug injection patch  iperf -P 50  mode switch.

 Is this clear for you now?
 
 
 NO, you say that the list_del_init patch eliminates the list
 corruption warning, does the list corruption is still reproduced in
 your test setup even when  the patch is applied?! what's the trace?
 and what is the trace you see in your production when using kernel X
 (which?) patches with commit Y (which?)
 
 Or.
 

No, the list_del_init fixed the list corruption warning, no list
corruption anymore.

The trace is in my earlier mail in this thread.

Jack



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: MLX4 Cq Question

2013-05-20 Thread Hefty, Sean
 My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe
 owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an
 error and therefore, the only way to recover what the opcode was is
 through the wr_id you used when submitting the WR.
 
 Is my reading of the code correct?

I believe this is true wrt the IB spec.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MLX4 Cq Question

2013-05-20 Thread Tom Tucker

On 5/20/13 2:58 PM, Hefty, Sean wrote:

My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe
owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an
error and therefore, the only way to recover what the opcode was is
through the wr_id you used when submitting the WR.

Is my reading of the code correct?

I believe this is true wrt the IB spec.

Thanks, this was my recollection as well.

Tom

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rockets feedbacks?

2013-05-20 Thread Or Gerlitz
On Mon, May 20, 2013 at 10:52 PM, Hefty, Sean sean.he...@intel.com wrote:
 Do we have some public quoted usages/feedback for rsockets? I think
 you've mentioned something during the panel at the Linux EU summit
 last week but I am not sure...

 Most feedback I can think of has come via private emails or personal 
 interactions, especially specific details of various usage models.

So if you were pushing these private conversations to linux-rdma, more
have been known on rsockets for the benefit of all... oh well. I think
you mentioned something re Intel HPC group, or I am wrong?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rockets feedbacks?

2013-05-20 Thread Hefty, Sean
 So if you were pushing these private conversations to linux-rdma, more
 have been known on rsockets for the benefit of all... oh well. I think
 you mentioned something re Intel HPC group, or I am wrong?

rsockets will continue to be supported by myself and Intel going forward.  The 
rsocket work originated from HPC related work done that I was doing with Xeon 
Phi, and targeted Intel MPI and OSU MPI as its initial application.  The Xeon 
Phi processors are slower than the traditional Xeon cores; I believe I 
mentioned that this was a problem, since the socket API requires data copies in 
practice.  So, extensions to the rsocket API is needed to support zero-copy 
data transfers.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] make read_config() more robust

2013-05-20 Thread Yann Droneaud
Hi,

Please find three patches to protect libibverbs from using invalid,
unsecure configuration files.

Thoses configurations files are usually located in
/etc/libibverbs.d/ and contains the name of a shared library
to dlopen().

Only legitimate shared libraries should be loaded by libibverbs,
so it must be careful on the configuration files used.

Regards.

Yann Droneaud (3):
  read_config: ignore files beginning with '.'
  read_config: ignore directory entry with backup suffix (~)
  read_config: skip file/directory with unsecure permissions

 src/init.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] read_config: ignore files beginning with '.'

2013-05-20 Thread Yann Droneaud
Files beginning with a dot are mostly current and parent directories or,
by convention, hidden files.

Those path are skipped in find_sysfs_dev().

Signed-off-by: Yann Droneaud ydrone...@opteya.com
---
 src/init.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/init.c b/src/init.c
index 96ecf04..c880b68 100644
--- a/src/init.c
+++ b/src/init.c
@@ -308,6 +308,9 @@ static void read_config(void)
while ((dent = readdir(conf_dir))) {
struct stat buf;
 
+   if (dent-d_name[0] == '.')
+   continue;
+
if (asprintf(path, %s/%s, IBV_CONFIG_DIR, dent-d_name)  0) 
{
fprintf(stderr, PFX Warning: couldn't read config file 
%s/%s.\n,
IBV_CONFIG_DIR, dent-d_name);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] read_config: ignore directory entry with backup suffix (~)

2013-05-20 Thread Yann Droneaud
Try to protect libibverbs from hand modified configuration files.

Signed-off-by: Yann Droneaud ydrone...@opteya.com
---
 src/init.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/init.c b/src/init.c
index c880b68..1981da7 100644
--- a/src/init.c
+++ b/src/init.c
@@ -311,6 +311,9 @@ static void read_config(void)
if (dent-d_name[0] == '.')
continue;
 
+   if (dent-d_name[0] == '\0' || 
dent-d_name[strlen(dent-d_name) - 1] == '~')
+   continue;
+
if (asprintf(path, %s/%s, IBV_CONFIG_DIR, dent-d_name)  0) 
{
fprintf(stderr, PFX Warning: couldn't read config file 
%s/%s.\n,
IBV_CONFIG_DIR, dent-d_name);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] read_config: skip file/directory with unsecure permissions

2013-05-20 Thread Yann Droneaud
libibverbs must refuse to load arbitrary shared objects.

This patch check the configuration directory and files for
- being owned by root;
- not being writable by others.

Signed-off-by: Yann Droneaud ydrone...@opteya.com
---
 src/init.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/src/init.c b/src/init.c
index 1981da7..a37b57d 100644
--- a/src/init.c
+++ b/src/init.c
@@ -294,10 +294,24 @@ static void read_config_file(const char *path)
 
 static void read_config(void)
 {
+   struct stat buf;
DIR *conf_dir;
struct dirent *dent;
char *path;
 
+   if (stat(IBV_CONFIG_DIR, buf) || !S_ISDIR(buf.st_mode)) {
+   fprintf(stderr, PFX Warning: couldn't stat config directory 
'%s'.\n,
+   IBV_CONFIG_DIR);
+   return;
+   }
+
+   if (buf.st_uid != 0 || buf.st_gid != 0 ||
+   (buf.st_mode  S_IWOTH) != 0) {
+   fprintf(stderr, PFX Warning: unsecure config directory 
'%s'.\n,
+   IBV_CONFIG_DIR);
+   return;
+   }
+
conf_dir = opendir(IBV_CONFIG_DIR);
if (!conf_dir) {
fprintf(stderr, PFX Warning: couldn't open config directory 
'%s'.\n,
@@ -306,8 +320,6 @@ static void read_config(void)
}
 
while ((dent = readdir(conf_dir))) {
-   struct stat buf;
-
if (dent-d_name[0] == '.')
continue;
 
@@ -329,6 +341,13 @@ static void read_config(void)
if (!S_ISREG(buf.st_mode))
goto next;
 
+   if (buf.st_uid != 0 || buf.st_gid != 0 ||
+   (buf.st_mode  S_IWOTH) != 0) {
+   fprintf(stderr, PFX Warning: unsecure config file 
'%s'.\n,
+   path);
+   goto next;
+   }
+
read_config_file(path);
 next:
free(path);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] add RAW Packet QP type

2013-05-20 Thread Shawn Bohrer
On Tue, Jan 17, 2012 at 10:21:28AM -0600, Steve Wise wrote:
 On 01/17/2012 09:59 AM, Or Gerlitz wrote:
 On 1/17/2012 5:08 PM, Steve Wise wrote:
 I think this series should add some new send flags for HW that
 does checksum offload [...] also, on ingress, most hardware can
 do INET checksum validation, and a way to indicate the results
 to the application is needed.  Perhaps flags in the CQE? [...]
 another form of HW assist is with VLAN insertion/extraction.
 The API should provide a way to specify if a VLAN ID should be
 inserted by HW and removed from a packet on ingress (and passed
 to the app via the CQE).  In fact, we probably want a way to
 associate a VLAN with a RAW QP, maybe as a QP attribute?
 
 Hi Steve,
 
 Nice to see how a discussion is quickly revived when the subject
 is close to people's needs... anyway, yep, sure, this submission
 allows for basic functionality, as I mentioned and there's more to
 add here, but this could (should...) be done gradually,  i.e in
 steps that adds on the basic patch. Now to the points you've
 raised:
 
 Sure, we want to be able to do checksum offload on TX and on RX as
 well. Adding checksum offload support will be done with a patch to
 libibverbs which is similar in spirit to commit e0605d IB/core:
 Add IP checksum offload support subject to reporting the RX
 checksum okay through new IB_WC_IP_CSUM_OK bit in ibv_wc-wc_flags
 along the lines of the patch I sent last week to the kernel. So we
 will have a new device capability and two new bit flags
 IB_SEND_IP_CSUM and IB_WC_IP_CSUM_OK, no ABI breaking... happy.
 
 Sounds good, with the caveat that we need more bits to specify L4 CSUM 
 generation as well as L3 (and L2 CRC).

Hi Or,

I appologize if I missed it, but did any support for L3/L4 CSUM
generation get added?  Doesn't look like the upstream libibverbs has
it, and I don't seem to see any patches floating around.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] configure: use automake's option subdir-objects

2013-05-20 Thread Yann Droneaud
Following advice in Autotool Mythbuster [1],
option subdir-objects can be used to have Makefiles
create object files in the same directory than
theirs source files.

It reduces clobbering in the build directory.

[1] Autotool Mythbuster, by Diego Elio Flameeyes Petten`o
http://www.flameeyes.eu/autotools-mythbuster/automake/nonrecursive.html

Signed-off-by: Yann Droneaud ydrone...@opteya.com
---
 configure.ac | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index a1d5365..231083c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -6,7 +6,7 @@ AC_CONFIG_SRCDIR([src/ibverbs.h])
 AC_CONFIG_AUX_DIR(config)
 AC_CONFIG_MACRO_DIR(config)
 AC_CONFIG_HEADER(config.h)
-AM_INIT_AUTOMAKE([foreign])
+AM_INIT_AUTOMAKE([foreign subdir-objects])
 m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])])
 
 dnl Checks for programs
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] configure: apply updates proposed by autoupdate

2013-05-20 Thread Yann Droneaud
'autoupdate' is a tool to help developer to update configure.ac.

This patch apply a few fixes as suggested by autoupdate.

It was tested on Debian 6.0.7 (Squeeze) and Fedora 17 (Beefy Miracle).

Signed-off-by: Yann Droneaud ydrone...@opteya.com
---
 configure.ac | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/configure.ac b/configure.ac
index 8f79891..a1d5365 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1,7 +1,7 @@
 dnl Process this file with autoconf to produce a configure script.
 
-AC_PREREQ(2.57)
-AC_INIT(libibverbs, 1.1.6, linux-rdma@vger.kernel.org)
+AC_PREREQ([2.67])
+AC_INIT([libibverbs],[1.1.6],[linux-rdma@vger.kernel.org])
 AC_CONFIG_SRCDIR([src/ibverbs.h])
 AC_CONFIG_AUX_DIR(config)
 AC_CONFIG_MACRO_DIR(config)
@@ -11,14 +11,12 @@ m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])])
 
 dnl Checks for programs
 AC_PROG_CC
-AC_GNU_SOURCE
+AC_USE_SYSTEM_EXTENSIONS
 AC_PROG_LN_S
-AC_PROG_LIBTOOL
-
 LT_INIT
 
 AC_ARG_WITH([valgrind],
-AC_HELP_STRING([--with-valgrind],
+AS_HELP_STRING([--with-valgrind],
 [Enable Valgrind annotations (small runtime overhead, default NO)]))
 if test x$with_valgrind = x || test x$with_valgrind = xno; then
 want_valgrind=no
@@ -84,9 +82,10 @@ fi
 AC_SUBST(LIBIBVERBS_VERSION_SCRIPT)
 
 AC_CACHE_CHECK(for .symver assembler support, ac_cv_asm_symver_support,
-[AC_TRY_COMPILE(, [asm(symbol:\n.symver symbol, api@ABI\n);],
-ac_cv_asm_symver_support=yes,
-ac_cv_asm_symver_support=no)])
+[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[]],
+[[asm(symbol:\n.symver symbol, api@ABI\n);]])],
+[ac_cv_asm_symver_support=yes],
+[ac_cv_asm_symver_support=no])])
 if test $ac_cv_asm_symver_support = yes; then
 AC_DEFINE([HAVE_SYMVER_SUPPORT], 1, [assembler has .symver support])
 fi
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] open files with close on exec flag

2013-05-20 Thread Yann Droneaud
File opened by libibverbs are not supposed to be inherited
across exec*(), most of the files are of no use for another program,
and others cannot be used without the associated memory mapping.

This patch changes open() and fopen() to always set close on exec flag.

This patch also add checks to configure to guess if fopen() supports e
flag. If O_CLOEXEC and SOCK_CLOEXEC are supported, fopen() should support
e. If not supported, its discarded according to POSIX. Many operating systems
have support for fopen(e).

You might find more information about close on exec in the following articles:

- Excuse me son, but your code is leaking !!! by Dan Walsh
http://danwalsh.livejournal.com/53603.html

- Secure File Descriptor Handling by Ulrich Drepper
http://udrepper.livejournal.com/20407.html

Note: this patch won't set close on exec flag on file descriptors
created by the kernel for completion channel and such. This should
be addressed by a kernel patch.

Signed-off-by: Yann Droneaud ydrone...@opteya.com
---
 configure.ac | 21 +
 src/device.c |  2 +-
 src/init.c   |  2 +-
 src/memory.c |  2 +-
 src/sysfs.c  |  2 +-
 5 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/configure.ac b/configure.ac
index efdc5ac..8f79891 100644
--- a/configure.ac
+++ b/configure.ac
@@ -44,9 +44,30 @@ AC_CHECK_HEADER(valgrind/memcheck.h,
 [if test $want_valgrind = yes; then
 AC_MSG_ERROR([Valgrind memcheck support requested, but 
valgrind/memcheck.h not found.])
 fi])
+AC_CHECK_HEADERS([fcntl.h sys/socket.h])
 
 dnl Checks for typedefs, structures, and compiler characteristics.
 AC_C_CONST
+AC_CHECK_DECLS([O_CLOEXEC],,[AC_DEFINE([O_CLOEXEC],[0], [Defined to 0 if not 
provided])],
+[[
+#ifdef HAVE_FCNTL_H
+# include fcntl.h
+#endif
+]])
+AC_CHECK_DECLS([SOCK_CLOEXEC],,[AC_DEFINE([SOCK_CLOEXEC],[0],[Defined to 0 if 
not provided])],
+[[
+#ifdef HAVE_SYS_SOCKET_H
+# include sys/socket.h
+#endif
+]])
+
+AC_CACHE_CHECK(for close on exec modifier for fopen(), 
ac_cv_feature_stream_cloexec_flag,
+   [if test $ac_cv_have_decl_O_CLOEXEC = yes ; then
+if test $ac_cv_have_decl_SOCK_CLOEXEC = yes ; then
+ac_cv_feature_stream_cloexec_flag=e
+fi
+fi])
+AC_DEFINE_UNQUOTED([STREAM_CLOEXEC], $ac_cv_feature_stream_cloexec_flag, 
[fopen() modifier for setting close on exec flag])
 
 AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script,
 [if test -n `$LD --help  /dev/null 2/dev/null | grep version-script`; 
then
diff --git a/src/device.c b/src/device.c
index 5798895..1923fa5 100644
--- a/src/device.c
+++ b/src/device.c
@@ -135,7 +135,7 @@ struct ibv_context *__ibv_open_device(struct ibv_device 
*device)
 * We'll only be doing writes, but we need O_RDWR in case the
 * provider needs to mmap() the file.
 */
-   cmd_fd = open(devpath, O_RDWR);
+   cmd_fd = open(devpath, O_RDWR | O_CLOEXEC);
free(devpath);
 
if (cmd_fd  0)
diff --git a/src/init.c b/src/init.c
index 8d6786e..8e93f3f 100644
--- a/src/init.c
+++ b/src/init.c
@@ -243,7 +243,7 @@ static void read_config_file(const char *path)
size_t buflen = 0;
ssize_t len;
 
-   conf = fopen(path, r);
+   conf = fopen(path, r STREAM_CLOEXEC);
if (!conf) {
fprintf(stderr, PFX Warning: couldn't read config file %s.\n,
path);
diff --git a/src/memory.c b/src/memory.c
index 7d97e55..e9d1eec 100644
--- a/src/memory.c
+++ b/src/memory.c
@@ -109,7 +109,7 @@ static unsigned long get_page_size(void *base)
pid = getpid();
snprintf(buf, sizeof(buf), /proc/%d/smaps, pid);
 
-   file = fopen(buf, r);
+   file = fopen(buf, r STREAM_CLOEXEC);
if (!file)
goto out;
 
diff --git a/src/sysfs.c b/src/sysfs.c
index 85aee39..e031631 100644
--- a/src/sysfs.c
+++ b/src/sysfs.c
@@ -85,7 +85,7 @@ int ibv_read_sysfs_file(const char *dir, const char *file,
if (asprintf(path, %s/%s, dir, file)  0)
return -1;
 
-   fd = open(path, O_RDONLY);
+   fd = open(path, O_RDONLY | O_CLOEXEC);
if (fd  0) {
free(path);
return -1;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 0/9] Add receive Flow Steering support

2013-05-20 Thread Shawn Bohrer
On Wed, Apr 24, 2013 at 04:58:43PM +0300, Or Gerlitz wrote:
 Hi Roland, all
 
 The first five patches in the series are mlx4 DMFS (Device Managed Flow 
 Steering) pre-patches needed for flow steering access from the mlx4 IB driver.
 
   net/mlx4_core: Move DMFS HW structs to common header file
   net/mlx4: Match DMFS promiscuous field names to firmware spec
   net/mlx4_core: Change few DMFS fields names to match firmare spec
   net/mlx4_core: Directly expose fields of DMFS HW rule control segment
   net/mlx4_core: Expose few helpers to fill DMFS HW strucutures
 
 We're submitting them through the rdma tree as the next patch depend on them,
 please apply them for 3.10 this way (accepting also the next patches) or 
 another (even if they are not accepted now), so we will not have extra 
 dependencies with further changes on that area.
 
 The next four patches add Flow Steering support to the kernel IB core, 
 to uverbs and to the mlx4 IB (verbs) driver along with one patch
 to uverbs which adds some code to support extensions.
 
   IB/core: Add receive Flow Steering support
   IB/core: Infra-structure to support verbs extensions through uverbs
   IB/core: Export ib_create/destroy_flow through uverbs
   IB/mlx4: Add receive Flow Steering support
 
 The main patch which introduces the Flow-Steering API is IB/core: Add 
 receive Flow 
 Steering support, see its change log. Looking on the Network Adapter Flow 
 Steering 
 slides from Tzahi Oved which he presented on the annual OFA 2012 meeting 
 could be helpful
 https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/518-network-adapter-flow-steering.html

Hi Or,

Are there any patches for libibverbs to add
ibv_create_flow/ibv_destroy_flow?  And are there any needed patches
for libmlx4?  I'm building up a stack so we can begin testing this
series.

Thanks,
Shawn
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html