Re: [RFC] Direct Sockets Support??

2001-05-08 Thread 'Pete Wyckoff'

[EMAIL PROTECTED] said:
>   > But in the case of an application which fits in main memory, and
>   > has been running for a while (so all pages are present and
>   > dirty), all you'd really have to do is verify the page tables are
>   > in the proper state and skip the TLB flush, right?
> 
>   We really cannot assume this. There are two cases 
>   a. when a user app wants to receive some data, it allocates
> memory(using malloc) and waits for the hw to do zero-copy read. The kernel
> does not allocate physical page frames for the entire memory region
> allocated. We need to lock the memory (and locking is expensive due to
> costly TLB flushes) to do this
> 
>   b. when a user app wants to send data, he fills the buffer
> and waits for the hw to transmit data, but under heavy physical memory
> pressure, the swapper might swap the pages we want to transmit. So we need
> to lock the memory to be 100% sure.

You're right, of course.  But I suspect that the fast path of
re-locking memory which is happily in core will go much faster
by removing the multi-processor TLB purge.  And it can't hurt,
unless I'm missing something.

-- Pete

--- linux-2.4.4-stock/mm/mlock.cTue May  8 17:26:34 2001
+++ linux/mm/mlock.cTue May  8 17:24:13 2001
@@ -114,6 +114,10 @@
return 0;
 }
 
+/* implemented in mm/memory.c */
+extern int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end);
+
 static int mlock_fixup(struct vm_area_struct * vma, 
unsigned long start, unsigned long end, unsigned int newflags)
 {
@@ -138,7 +142,7 @@
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
-   make_pages_present(start, end);
+   mlock_make_pages_present(vma, start, end);
}
vma->vm_mm->locked_vm -= pages;
}

--- linux-2.4.4-stock/mm/memory.c   Tue May  8 17:25:36 2001
+++ linux/mm/memory.c   Tue May  8 17:24:40 2001
@@ -1438,3 +1438,80 @@
} while (addr < end);
return 0;
 }
+
+/*
+ * Specialized version of make_pages_present which does not require
+ * a multi-processor TLB purge for every page if nothing about the PTE
+ * was modified.
+ */
+int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end)
+{
+   int ret, write;
+   struct mm_struct *mm = current->mm;
+
+   write = (vma->vm_flags & VM_WRITE) != 0;
+
+   /*
+* We need the page table lock to synchronize with kswapd
+* and the SMP-safe atomic PTE updates.
+*/
+   spin_lock(>page_table_lock);
+
+   ret = 0;
+   for (ret=0; !ret && addr < end; addr += PAGE_SIZE) {
+   pgd_t *pgd;
+   pmd_t *pmd;
+   pte_t *pte, entry;
+   int modified;
+
+   current->state = TASK_RUNNING;
+   pgd = pgd_offset(mm, addr);
+   pmd = pmd_alloc(mm, pgd, addr);
+   if (!pmd) {
+   ret = -1;
+   break;
+   }
+   pte = pte_alloc(mm, pmd, addr);
+   if (!pte) {
+   ret = -1;
+   break;
+   }
+   entry = *pte;
+   if (!pte_present(entry)) {
+   /*
+* If it truly wasn't present, we know that kswapd
+* and the PTE updates will not touch it later. So
+* drop the lock.
+*/
+   if (pte_none(entry)) {
+   ret = do_no_page(mm, vma, addr, write, pte);
+   continue;
+   }
+   ret = do_swap_page(mm, vma, addr, pte,
+   pte_to_swp_entry(entry), write);
+   continue;
+   }
+
+   modified = 0;
+   if (write) {
+   if (!pte_write(entry)) {
+   ret = do_wp_page(mm, vma, addr, pte, entry);
+   continue;
+   }
+   if (!pte_dirty(entry)) {
+   entry = pte_mkdirty(entry);
+   modified = 1;
+   }
+   }
+   if (!pte_young(entry)) {
+   entry = pte_mkyoung(entry);
+   modified = 1;
+   }
+   if (modified)
+   establish_pte(vma, addr, pte, entry);
+   }
+
+   spin_unlock(>page_table_lock);
+   return ret;
+}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Alan Cox

>   a. when a user app wants to receive some data, it allocates
> memory(using malloc) and waits for the hw to do zero-copy read. The kernel
> does not allocate physical page frames for the entire memory region
> allocated. We need to lock the memory (and locking is expensive due to
> costly TLB flushes) to do this
> 
>   b. when a user app wants to send data, he fills the buffer
> and waits for the hw to transmit data, but under heavy physical memory
> pressure, the swapper might swap the pages we want to transmit. So we need
> to lock the memory to be 100% sure.
> 

Or c) you prealloc two ring buffers.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-08 Thread Venkatesh Ramamurthy

> But in the case of an application which fits in main memory, and
> has been running for a while (so all pages are present and
> dirty), all you'd really have to do is verify the page tables are
> in the proper state and skip the TLB flush, right?

We really cannot assume this. There are two cases 
a. when a user app wants to receive some data, it allocates
memory(using malloc) and waits for the hw to do zero-copy read. The kernel
does not allocate physical page frames for the entire memory region
allocated. We need to lock the memory (and locking is expensive due to
costly TLB flushes) to do this

b. when a user app wants to send data, he fills the buffer
and waits for the hw to transmit data, but under heavy physical memory
pressure, the swapper might swap the pages we want to transmit. So we need
to lock the memory to be 100% sure.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> > A couple of concerns I have:
> >  * How to pin or pagelock the application buffer without
> > making a kernel transition.
> 
> You need to pin them in advance. And pinning pages is _expensive_ so you dont
> want to keep pinning/unpinning pages

I can't convince myself why this has to be so expensive.  The
current implementation does this for mlock:

1.  Split vma if only a subset of the pages are being locked.
2.  Mark bit in vma.
3.  Make sure the pages are in core.

That third step has the potential of being the most expensive,
as changing the page tables requires invalidating the TLBs on all
processors.  Currently make_pages_present() does the work for 3.

But in the case of an application which fits in main memory, and
has been running for a while (so all pages are present and
dirty), all you'd really have to do is verify the page tables are
in the proper state and skip the TLB flush, right?

Then 3 turns into a single spin_lock pair for the page_table_lock, 
and walking down the page table.

The VMA splitting can be nasty, as it might require a couple of
slab allocations, and doing an AVL insertion.  (More nastiness in
the case of shared memory or file mapping, too.)  But nothing
like playing with TLBs.

Any reason why make_pages_present() is not the really oversized
hammer it seems to be?

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Alan Cox

> A couple of concerns I have:
>  * How to pin or pagelock the application buffer without
> making a kernel transition.

You need to pin them in advance. And pinning pages is _expensive_ so you dont
want to keep pinning/unpinning pages

>  * Assuming the memory can be locked down, how can a list 
> of physical memory ranges be obtained (necessary to support 
> scatter/gather DMA? Is kiobufs suitable with it's page-alignment 
> constraints? If kiobufs will work, how can the kernel transition be 
> avoided?

kiovecs will do that. It might be a little heavyweight but that should improve
in 2.5 as we move to a slightly lighter model

> WinSock Direct seems to address these concerns.  These issues
> become important at 1Gb and 10Gb speeds.

1Gbit - not really, 10Gbit yes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Alan Cox

 A couple of concerns I have:
  * How to pin or pagelock the application buffer without
 making a kernel transition.

You need to pin them in advance. And pinning pages is _expensive_ so you dont
want to keep pinning/unpinning pages

  * Assuming the memory can be locked down, how can a list 
 of physical memory ranges be obtained (necessary to support 
 scatter/gather DMA? Is kiobufs suitable with it's page-alignment 
 constraints? If kiobufs will work, how can the kernel transition be 
 avoided?

kiovecs will do that. It might be a little heavyweight but that should improve
in 2.5 as we move to a slightly lighter model

 WinSock Direct seems to address these concerns.  These issues
 become important at 1Gb and 10Gb speeds.

1Gbit - not really, 10Gbit yes
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
  A couple of concerns I have:
   * How to pin or pagelock the application buffer without
  making a kernel transition.
 
 You need to pin them in advance. And pinning pages is _expensive_ so you dont
 want to keep pinning/unpinning pages

I can't convince myself why this has to be so expensive.  The
current implementation does this for mlock:

1.  Split vma if only a subset of the pages are being locked.
2.  Mark bit in vma.
3.  Make sure the pages are in core.

That third step has the potential of being the most expensive,
as changing the page tables requires invalidating the TLBs on all
processors.  Currently make_pages_present() does the work for 3.

But in the case of an application which fits in main memory, and
has been running for a while (so all pages are present and
dirty), all you'd really have to do is verify the page tables are
in the proper state and skip the TLB flush, right?

Then 3 turns into a single spin_lock pair for the page_table_lock, 
and walking down the page table.

The VMA splitting can be nasty, as it might require a couple of
slab allocations, and doing an AVL insertion.  (More nastiness in
the case of shared memory or file mapping, too.)  But nothing
like playing with TLBs.

Any reason why make_pages_present() is not the really oversized
hammer it seems to be?

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-08 Thread Venkatesh Ramamurthy

 But in the case of an application which fits in main memory, and
 has been running for a while (so all pages are present and
 dirty), all you'd really have to do is verify the page tables are
 in the proper state and skip the TLB flush, right?

We really cannot assume this. There are two cases 
a. when a user app wants to receive some data, it allocates
memory(using malloc) and waits for the hw to do zero-copy read. The kernel
does not allocate physical page frames for the entire memory region
allocated. We need to lock the memory (and locking is expensive due to
costly TLB flushes) to do this

b. when a user app wants to send data, he fills the buffer
and waits for the hw to transmit data, but under heavy physical memory
pressure, the swapper might swap the pages we want to transmit. So we need
to lock the memory to be 100% sure.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Alan Cox

   a. when a user app wants to receive some data, it allocates
 memory(using malloc) and waits for the hw to do zero-copy read. The kernel
 does not allocate physical page frames for the entire memory region
 allocated. We need to lock the memory (and locking is expensive due to
 costly TLB flushes) to do this
 
   b. when a user app wants to send data, he fills the buffer
 and waits for the hw to transmit data, but under heavy physical memory
 pressure, the swapper might swap the pages we want to transmit. So we need
 to lock the memory to be 100% sure.
 

Or c) you prealloc two ring buffers.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread 'Pete Wyckoff'

[EMAIL PROTECTED] said:
But in the case of an application which fits in main memory, and
has been running for a while (so all pages are present and
dirty), all you'd really have to do is verify the page tables are
in the proper state and skip the TLB flush, right?
 
   We really cannot assume this. There are two cases 
   a. when a user app wants to receive some data, it allocates
 memory(using malloc) and waits for the hw to do zero-copy read. The kernel
 does not allocate physical page frames for the entire memory region
 allocated. We need to lock the memory (and locking is expensive due to
 costly TLB flushes) to do this
 
   b. when a user app wants to send data, he fills the buffer
 and waits for the hw to transmit data, but under heavy physical memory
 pressure, the swapper might swap the pages we want to transmit. So we need
 to lock the memory to be 100% sure.

You're right, of course.  But I suspect that the fast path of
re-locking memory which is happily in core will go much faster
by removing the multi-processor TLB purge.  And it can't hurt,
unless I'm missing something.

-- Pete

--- linux-2.4.4-stock/mm/mlock.cTue May  8 17:26:34 2001
+++ linux/mm/mlock.cTue May  8 17:24:13 2001
@@ -114,6 +114,10 @@
return 0;
 }
 
+/* implemented in mm/memory.c */
+extern int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end);
+
 static int mlock_fixup(struct vm_area_struct * vma, 
unsigned long start, unsigned long end, unsigned int newflags)
 {
@@ -138,7 +142,7 @@
pages = (end - start)  PAGE_SHIFT;
if (newflags  VM_LOCKED) {
pages = -pages;
-   make_pages_present(start, end);
+   mlock_make_pages_present(vma, start, end);
}
vma-vm_mm-locked_vm -= pages;
}

--- linux-2.4.4-stock/mm/memory.c   Tue May  8 17:25:36 2001
+++ linux/mm/memory.c   Tue May  8 17:24:40 2001
@@ -1438,3 +1438,80 @@
} while (addr  end);
return 0;
 }
+
+/*
+ * Specialized version of make_pages_present which does not require
+ * a multi-processor TLB purge for every page if nothing about the PTE
+ * was modified.
+ */
+int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end)
+{
+   int ret, write;
+   struct mm_struct *mm = current-mm;
+
+   write = (vma-vm_flags  VM_WRITE) != 0;
+
+   /*
+* We need the page table lock to synchronize with kswapd
+* and the SMP-safe atomic PTE updates.
+*/
+   spin_lock(mm-page_table_lock);
+
+   ret = 0;
+   for (ret=0; !ret  addr  end; addr += PAGE_SIZE) {
+   pgd_t *pgd;
+   pmd_t *pmd;
+   pte_t *pte, entry;
+   int modified;
+
+   current-state = TASK_RUNNING;
+   pgd = pgd_offset(mm, addr);
+   pmd = pmd_alloc(mm, pgd, addr);
+   if (!pmd) {
+   ret = -1;
+   break;
+   }
+   pte = pte_alloc(mm, pmd, addr);
+   if (!pte) {
+   ret = -1;
+   break;
+   }
+   entry = *pte;
+   if (!pte_present(entry)) {
+   /*
+* If it truly wasn't present, we know that kswapd
+* and the PTE updates will not touch it later. So
+* drop the lock.
+*/
+   if (pte_none(entry)) {
+   ret = do_no_page(mm, vma, addr, write, pte);
+   continue;
+   }
+   ret = do_swap_page(mm, vma, addr, pte,
+   pte_to_swp_entry(entry), write);
+   continue;
+   }
+
+   modified = 0;
+   if (write) {
+   if (!pte_write(entry)) {
+   ret = do_wp_page(mm, vma, addr, pte, entry);
+   continue;
+   }
+   if (!pte_dirty(entry)) {
+   entry = pte_mkdirty(entry);
+   modified = 1;
+   }
+   }
+   if (!pte_young(entry)) {
+   entry = pte_mkyoung(entry);
+   modified = 1;
+   }
+   if (modified)
+   establish_pte(vma, addr, pte, entry);
+   }
+
+   spin_unlock(mm-page_table_lock);
+   return ret;
+}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ 

Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

>   Thats exactly my point, we need to define a new protocol family to
> support it. This means that all applications using PF_INET needs to be
> changed and recompiled. My basic argument goes like this if hardware can

Thanks to the magic of shared libraries and LD_PRELOAD a library hook can
actually make the decision underneath the application
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy

>> technology is Infiniband . In Infiniband, the hardware supports
IPv6 . For
>> this type of devices there is no need for software TCP/IP. But
for
>> networking application, which mostly uses sockets, there is a
performance
>> penalty with using software TCP/IP over this hardware. 

> IPv6 is only the bottom layer of the stack. TCP does a lot lot
more.

Sorry to have confused you. IB supports the notion of connection
over IPv6, not exactly TCP. I just interchanged TCP and notion of connection
provided by infiniband. Infiniband is a cluster of technologies like VI, IP,
etc. So i felt that we can take advantage of this to do networking. Because
the speed of IB ranges from 2.5Gbps to 30Gbps, even a slight overhead in
software will affect performance very badly.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy

> For the case where the routing will be external. Thats conveniently
> something
> you can deduce in advance. In theory nothing stops you implementing this.
> Conventionally you would do that with BSD sockets by implementing a new
> socket family PF_INFINIBAND. You might then choose to make the selection
> of that either done by the application or under it by C library overrides.
> 
Thats exactly my point, we need to define a new protocol family to
support it. This means that all applications using PF_INET needs to be
changed and recompiled. My basic argument goes like this if hardware can
support the notion of connection, the sockets layer should be aware of this
and send all request to the hw. I can assign an IPv4 address(for sake of
backward compatiblity) and get away w/o software TCP/IP.i get the
performance benefit of hardware TCP/IP (notion of connection). 

The windoze 2000 DDK has an interesting section about WinSock
direct(r) that lets the SAN hardware (like IB) to still use traditional
PF_INET for it.

Also one interesting whitepaper 

http://servernet.himalaya.compaq.com/snet2/whitepapers/WSD_Perf_White_Paper_
3-21-01.doc


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

> different topology subnets. Fabrics like Infiniband provide security on
> hardware, so there is no need to worry about it. The simple point  is that
> hw supports TCP/IP, then why do we need a software TCP/IP over it?

For the case where the routing will be external. Thats conveniently something
you can deduce in advance. In theory nothing stops you implementing this.
Conventionally you would do that with BSD sockets by implementing a new
socket family PF_INFINIBAND. You might then choose to make the selection
of that either done by the application or under it by C library overrides.

A network protocol stack is also not required to use sk_buffs, or to use
conventional dev_queue_foo() models so you can write a fairly thin layer.
What I am not sure about would be the best way to implement read/write
operations if the hardware can support these without kernel calls - ie
via mmap and secure page access.

That bit is an interesting problem

Alan



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Jesse Pollard

-  Received message begins Here  -

> 
> 
>   > Doesn't this bypass all of the network security controls? Granted
> - it is
>   > completely reasonable in a dedicated environment, but I would
> think the
>   > security loss would prevent it from being used for most usage.
> 
>   Direct Sockets makes sense only in clustering (server farms) to
> reduce intra-farm communication. It is *not* supposed to be used for regular
> internet. Direct Sockets over subnets is also tough to implement it across
> different topology subnets. Fabrics like Infiniband provide security on
> hardware, so there is no need to worry about it. The simple point  is that
> hw supports TCP/IP, then why do we need a software TCP/IP over it?

Because the hardware doesn't have the users security context. All it can
see are addresses, socket numbers and protocol. Neither can it be extended
with that information (IPSec). Authentication of the connections are not
possible.

Now... If the server farm only runs one job at a time, it is irrelevent...

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

> technology is Infiniband . In Infiniband, the hardware supports IPv6 . For
> this type of devices there is no need for software TCP/IP. But for
> networking application, which mostly uses sockets, there is a performance
> penalty with using software TCP/IP over this hardware. 

IPv6 is only the bottom layer of the stack. TCP does a lot lot more.

> > access setup is actually needed.
> > 
>   My point is that if the hardware is capable of doing TCP/IP , we
> should let the sockets layer talk directly to it (direct sockets). Thereby
> the application which uses the sockets will get better performance.

That depends on where your overheads are. Remember that for every direct
access you make you trade off kernel syscall overhead against userspace
scheduling and locking overhead. 

The VI architecture tries to design well to handle this I've not seen enough
about infiniband to know that. The 'better performance' is an assumption that
isnt always as simple as it seems - especially with high mtu values and
real world applications

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy


> Doesn't this bypass all of the network security controls? Granted
- it is
> completely reasonable in a dedicated environment, but I would
think the
> security loss would prevent it from being used for most usage.

Direct Sockets makes sense only in clustering (server farms) to
reduce intra-farm communication. It is *not* supposed to be used for regular
internet. Direct Sockets over subnets is also tough to implement it across
different topology subnets. Fabrics like Infiniband provide security on
hardware, so there is no need to worry about it. The simple point  is that
hw supports TCP/IP, then why do we need a software TCP/IP over it?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Jesse Pollard


>   > Define 'direct sockets' firstly.
>   Direct Sockets is the ablity by which the application(using sockets)
> can use the hardwares features to provide connection, flow control,
> etc.,instead of the TCP and IP software module. A typical hardware
> technology is Infiniband . In Infiniband, the hardware supports IPv6 . For
> this type of devices there is no need for software TCP/IP. But for
> networking application, which mostly uses sockets, there is a performance
> penalty with using software TCP/IP over this hardware. 
> 
> > I have seen several lines of attack on very high bandwidth devices.
> > Firstly
> > the linux projects a while ago doing usermode message passing directly
> > over
> > network cards for ultra low latency. Secondly there was a VI based project
> > that was mostly driven from userspace.
> > 
>   The application needs to rewritten to use VIPL, but if we could
> provide a sockets over VI (or Sockets over IB), then the existing
> applications can run with a known environment. 
> 
> 
> > One thing that remains unresolved is the question as to whether the very
> > low
> > cost Linux syscalls and zero copy are enough to achieve this using a
> > conventional socket API and the kernel space, or whether a hybrid direct 
> > access setup is actually needed.
> > 
>   My point is that if the hardware is capable of doing TCP/IP , we
> should let the sockets layer talk directly to it (direct sockets). Thereby
> the application which uses the sockets will get better performance.

Doesn't this bypass all of the network security controls? Granted - it is
completely reasonable in a dedicated environment, but I would think the
security loss would prevent it from being used for most usage.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy


> Define 'direct sockets' firstly.
Direct Sockets is the ablity by which the application(using sockets)
can use the hardwares features to provide connection, flow control,
etc.,instead of the TCP and IP software module. A typical hardware
technology is Infiniband . In Infiniband, the hardware supports IPv6 . For
this type of devices there is no need for software TCP/IP. But for
networking application, which mostly uses sockets, there is a performance
penalty with using software TCP/IP over this hardware. 

> I have seen several lines of attack on very high bandwidth devices.
> Firstly
> the linux projects a while ago doing usermode message passing directly
> over
> network cards for ultra low latency. Secondly there was a VI based project
> that was mostly driven from userspace.
> 
The application needs to rewritten to use VIPL, but if we could
provide a sockets over VI (or Sockets over IB), then the existing
applications can run with a known environment. 


> One thing that remains unresolved is the question as to whether the very
> low
> cost Linux syscalls and zero copy are enough to achieve this using a
> conventional socket API and the kernel space, or whether a hybrid direct 
> access setup is actually needed.
> 
My point is that if the hardware is capable of doing TCP/IP , we
should let the sockets layer talk directly to it (direct sockets). Thereby
the application which uses the sockets will get better performance.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

> With the advent of VI and Infiniband, there is a growing need to support =
> Sockets over such new technologies. I studied recent performance =
> analysis of sockets vs direct sockets and found that there is a 250% =
> performance hike and 30% decrease in latency time. Also CPU bandwidth is =
> significantly reduced.=20

Define 'direct sockets' firstly.

I have seen several lines of attack on very high bandwidth devices. Firstly
the linux projects a while ago doing usermode message passing directly over
network cards for ultra low latency. Secondly there was a VI based project
that was mostly driven from userspace.

One thing that remains unresolved is the question as to whether the very low
cost Linux syscalls and zero copy are enough to achieve this using a
conventional socket API and the kernel space, or whether a hybrid direct 
access setup is actually needed.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

 With the advent of VI and Infiniband, there is a growing need to support =
 Sockets over such new technologies. I studied recent performance =
 analysis of sockets vs direct sockets and found that there is a 250% =
 performance hike and 30% decrease in latency time. Also CPU bandwidth is =
 significantly reduced.=20

Define 'direct sockets' firstly.

I have seen several lines of attack on very high bandwidth devices. Firstly
the linux projects a while ago doing usermode message passing directly over
network cards for ultra low latency. Secondly there was a VI based project
that was mostly driven from userspace.

One thing that remains unresolved is the question as to whether the very low
cost Linux syscalls and zero copy are enough to achieve this using a
conventional socket API and the kernel space, or whether a hybrid direct 
access setup is actually needed.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy


 Define 'direct sockets' firstly.
Direct Sockets is the ablity by which the application(using sockets)
can use the hardwares features to provide connection, flow control,
etc.,instead of the TCP and IP software module. A typical hardware
technology is Infiniband . In Infiniband, the hardware supports IPv6 . For
this type of devices there is no need for software TCP/IP. But for
networking application, which mostly uses sockets, there is a performance
penalty with using software TCP/IP over this hardware. 

 I have seen several lines of attack on very high bandwidth devices.
 Firstly
 the linux projects a while ago doing usermode message passing directly
 over
 network cards for ultra low latency. Secondly there was a VI based project
 that was mostly driven from userspace.
 
The application needs to rewritten to use VIPL, but if we could
provide a sockets over VI (or Sockets over IB), then the existing
applications can run with a known environment. 


 One thing that remains unresolved is the question as to whether the very
 low
 cost Linux syscalls and zero copy are enough to achieve this using a
 conventional socket API and the kernel space, or whether a hybrid direct 
 access setup is actually needed.
 
My point is that if the hardware is capable of doing TCP/IP , we
should let the sockets layer talk directly to it (direct sockets). Thereby
the application which uses the sockets will get better performance.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Jesse Pollard


Define 'direct sockets' firstly.
   Direct Sockets is the ablity by which the application(using sockets)
 can use the hardwares features to provide connection, flow control,
 etc.,instead of the TCP and IP software module. A typical hardware
 technology is Infiniband . In Infiniband, the hardware supports IPv6 . For
 this type of devices there is no need for software TCP/IP. But for
 networking application, which mostly uses sockets, there is a performance
 penalty with using software TCP/IP over this hardware. 
 
  I have seen several lines of attack on very high bandwidth devices.
  Firstly
  the linux projects a while ago doing usermode message passing directly
  over
  network cards for ultra low latency. Secondly there was a VI based project
  that was mostly driven from userspace.
  
   The application needs to rewritten to use VIPL, but if we could
 provide a sockets over VI (or Sockets over IB), then the existing
 applications can run with a known environment. 
 
 
  One thing that remains unresolved is the question as to whether the very
  low
  cost Linux syscalls and zero copy are enough to achieve this using a
  conventional socket API and the kernel space, or whether a hybrid direct 
  access setup is actually needed.
  
   My point is that if the hardware is capable of doing TCP/IP , we
 should let the sockets layer talk directly to it (direct sockets). Thereby
 the application which uses the sockets will get better performance.

Doesn't this bypass all of the network security controls? Granted - it is
completely reasonable in a dedicated environment, but I would think the
security loss would prevent it from being used for most usage.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy


 Doesn't this bypass all of the network security controls? Granted
- it is
 completely reasonable in a dedicated environment, but I would
think the
 security loss would prevent it from being used for most usage.

Direct Sockets makes sense only in clustering (server farms) to
reduce intra-farm communication. It is *not* supposed to be used for regular
internet. Direct Sockets over subnets is also tough to implement it across
different topology subnets. Fabrics like Infiniband provide security on
hardware, so there is no need to worry about it. The simple point  is that
hw supports TCP/IP, then why do we need a software TCP/IP over it?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

 technology is Infiniband . In Infiniband, the hardware supports IPv6 . For
 this type of devices there is no need for software TCP/IP. But for
 networking application, which mostly uses sockets, there is a performance
 penalty with using software TCP/IP over this hardware. 

IPv6 is only the bottom layer of the stack. TCP does a lot lot more.

  access setup is actually needed.
  
   My point is that if the hardware is capable of doing TCP/IP , we
 should let the sockets layer talk directly to it (direct sockets). Thereby
 the application which uses the sockets will get better performance.

That depends on where your overheads are. Remember that for every direct
access you make you trade off kernel syscall overhead against userspace
scheduling and locking overhead. 

The VI architecture tries to design well to handle this I've not seen enough
about infiniband to know that. The 'better performance' is an assumption that
isnt always as simple as it seems - especially with high mtu values and
real world applications

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Jesse Pollard

-  Received message begins Here  -

 
 
Doesn't this bypass all of the network security controls? Granted
 - it is
completely reasonable in a dedicated environment, but I would
 think the
security loss would prevent it from being used for most usage.
 
   Direct Sockets makes sense only in clustering (server farms) to
 reduce intra-farm communication. It is *not* supposed to be used for regular
 internet. Direct Sockets over subnets is also tough to implement it across
 different topology subnets. Fabrics like Infiniband provide security on
 hardware, so there is no need to worry about it. The simple point  is that
 hw supports TCP/IP, then why do we need a software TCP/IP over it?

Because the hardware doesn't have the users security context. All it can
see are addresses, socket numbers and protocol. Neither can it be extended
with that information (IPSec). Authentication of the connections are not
possible.

Now... If the server farm only runs one job at a time, it is irrelevent...

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

 different topology subnets. Fabrics like Infiniband provide security on
 hardware, so there is no need to worry about it. The simple point  is that
 hw supports TCP/IP, then why do we need a software TCP/IP over it?

For the case where the routing will be external. Thats conveniently something
you can deduce in advance. In theory nothing stops you implementing this.
Conventionally you would do that with BSD sockets by implementing a new
socket family PF_INFINIBAND. You might then choose to make the selection
of that either done by the application or under it by C library overrides.

A network protocol stack is also not required to use sk_buffs, or to use
conventional dev_queue_foo() models so you can write a fairly thin layer.
What I am not sure about would be the best way to implement read/write
operations if the hardware can support these without kernel calls - ie
via mmap and secure page access.

That bit is an interesting problem

Alan



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy

 For the case where the routing will be external. Thats conveniently
 something
 you can deduce in advance. In theory nothing stops you implementing this.
 Conventionally you would do that with BSD sockets by implementing a new
 socket family PF_INFINIBAND. You might then choose to make the selection
 of that either done by the application or under it by C library overrides.
 
Thats exactly my point, we need to define a new protocol family to
support it. This means that all applications using PF_INET needs to be
changed and recompiled. My basic argument goes like this if hardware can
support the notion of connection, the sockets layer should be aware of this
and send all request to the hw. I can assign an IPv4 address(for sake of
backward compatiblity) and get away w/o software TCP/IP.i get the
performance benefit of hardware TCP/IP (notion of connection). 

The windoze 2000 DDK has an interesting section about WinSock
direct(r) that lets the SAN hardware (like IB) to still use traditional
PF_INET for it.

Also one interesting whitepaper 

http://servernet.himalaya.compaq.com/snet2/whitepapers/WSD_Perf_White_Paper_
3-21-01.doc


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: [RFC] Direct Sockets Support??

2001-05-03 Thread Venkatesh Ramamurthy

 technology is Infiniband . In Infiniband, the hardware supports
IPv6 . For
 this type of devices there is no need for software TCP/IP. But
for
 networking application, which mostly uses sockets, there is a
performance
 penalty with using software TCP/IP over this hardware. 

 IPv6 is only the bottom layer of the stack. TCP does a lot lot
more.

Sorry to have confused you. IB supports the notion of connection
over IPv6, not exactly TCP. I just interchanged TCP and notion of connection
provided by infiniband. Infiniband is a cluster of technologies like VI, IP,
etc. So i felt that we can take advantage of this to do networking. Because
the speed of IB ranges from 2.5Gbps to 30Gbps, even a slight overhead in
software will affect performance very badly.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-03 Thread Alan Cox

   Thats exactly my point, we need to define a new protocol family to
 support it. This means that all applications using PF_INET needs to be
 changed and recompiled. My basic argument goes like this if hardware can

Thanks to the magic of shared libraries and LD_PRELOAD a library hook can
actually make the decision underneath the application
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/