Re: FW: ixg(4) performances

2014-08-31 Thread Masanobu SAITOH
Hi, Emmanuel.

On 2014/09/01 11:10, Emmanuel Dreyfus wrote:
> Terry Moore  wrote:
> 
>> Since you did a dword read, the extra 0x9 is the device status register.
>> This makes me suspicious as the device status register is claiming that you
>> have "unsupported request detected)" [bit 3] and "correctable error
>> detected" [bit 0].  Further, this register is RW1C for all these bits -- so
>> when you write 94810, it should have cleared the 9 (so a subsequent read
>> should have returned 4810).
>>
>> Please check.
> 
> You are right;
> # pcictl /dev/pci5 read -d 0 -f 1 0xa8
> 00092810
> # pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810
> # pcictl /dev/pci5 read -d 0 -f 1 0xa8
> 4810
> 
>> Might be good to post a "pcictl dump" of your device, just to expose all the
>> details.
> 
> It explicitely says 2.5 Gb/s x 8 lanes
> 
> # pcictl /dev/pci5 dump -d0 -f 1
> PCI configuration registers:
>Common header:
>  0x00: 0x10fb8086 0x00100107 0x0201 0x00800010
> 
>  Vendor Name: Intel (0x8086)
>  Device Name: 82599 (SFI/SFP+) 10 GbE Controller (0x10fb)
>  Command register: 0x0107
>I/O space accesses: on
>Memory space accesses: on
>Bus mastering: on
>Special cycles: off
>MWI transactions: off
>Palette snooping: off
>Parity error checking: off
>Address/data stepping: off
>System error (SERR): on
>Fast back-to-back transactions: off
>Interrupt disable: off
>  Status register: 0x0010
>Interrupt status: inactive
>Capability List support: on
>66 MHz capable: off
>User Definable Features (UDF) support: off
>Fast back-to-back capable: off
>Data parity error detected: off
>DEVSEL timing: fast (0x0)
>Slave signaled Target Abort: off
>Master received Target Abort: off
>Master received Master Abort: off
>Asserted System Error (SERR): off
>Parity error detected: off
>  Class Name: network (0x02)
>  Subclass Name: ethernet (0x00)
>  Interface: 0x00
>  Revision ID: 0x01
>  BIST: 0x00
>  Header Type: 0x00+multifunction (0x80)
>  Latency Timer: 0x00
>  Cache Line Size: 0x10
> 
>Type 0 ("normal" device) header:
>  0x10: 0xdfe8000c 0x 0xbc01 0x
>  0x20: 0xdfe7c00c 0x 0x 0x00038086
>  0x30: 0x 0x0040 0x 0x0209
> 
>  Base address register at 0x10
>type: 64-bit prefetchable memory
>base: 0xdfe8, not sized
>  Base address register at 0x18
>type: i/o
>base: 0xbc00, not sized
>  Base address register at 0x1c
>not implemented(?)
>  Base address register at 0x20
>type: 64-bit prefetchable memory
>base: 0xdfe7c000, not sized
>  Cardbus CIS Pointer: 0x
>  Subsystem vendor ID: 0x8086
>  Subsystem ID: 0x0003
>  Expansion ROM Base Address: 0x
>  Capability list pointer: 0x40
>  Reserved @ 0x38: 0x
>  Maximum Latency: 0x00
>  Minimum Grant: 0x00
>  Interrupt pin: 0x02 (pin B)
>  Interrupt line: 0x09
> 
>Capability register at 0x40
>  type: 0x01 (Power Management, rev. 1.0)
>Capability register at 0x50
>  type: 0x05 (MSI)
>Capability register at 0x70
>  type: 0x11 (MSI-X)
>Capability register at 0xa0
>  type: 0x10 (PCI Express)
> 
>PCI Message Signaled Interrupt
>  Message Control register: 0x0180
>MSI Enabled: no
>Multiple Message Capable: no (1 vector)
>Multiple Message Enabled: off (1 vector)
>64 Bit Address Capable: yes
>Per-Vector Masking Capable: yes
>  Message Address (lower) register: 0x
>  Message Address (upper) register: 0x
>  Message Data register: 0x
>  Vector Mask register: 0x
>  Vector Pending register: 0x
> 
>PCI Power Management Capabilities Register
>  Capabilities register: 0x4823
>Version: 1.2
>PME# clock: off
>Device specific initialization: on
>3.3V auxiliary current: self-powered
>D1 power management state support: off
>D2 power management state support: off
>PME# support: 0x09
>  Control/status register: 0x2000
>Power state: D0
>PCI Express reserved: off
>No soft reset: off
>PME# assertion disabled
>PME# status: off
> 
>PCI Express Capabilities Register
>  Capability version: 2
>  Device type: PCI Express Endpoint device
>  Interrupt Message Number: 0
>  Link Capabilities Register: 0x00027482
>Maximum Link Speed: unknown 2 value
>Maximum Link Width: x8 lanes
>Port Number: 0
>  Link Status Register: 0x1081
>Negotiated Link Speed: 2.5Gb/s
*

Which Version of NetBSD are you using?

 I committed some changes fixing Gb/s

Re: Unallocated inode

2014-08-31 Thread Paul Ripke
On Fri, 29 Aug 2014 09:43:39 +, Christos Zoulas wrote:
>On Fri, Aug 29, 2014 at 04:56:46PM +1000, Paul Ripke wrote:
>> I'm currently running kernel:
>> NetBSD slave 6.1_STABLE NetBSD 6.1_STABLE (SLAVE) #4: Fri May 23 23:42:30 
>> EST 2014
>> stix@slave:/home/netbsd/netbsd-6/obj.amd64/home/netbsd/netbsd-6/src/sys/arch/amd64/compile/SLAVE
>>  amd64
>> Built from netbsd-6 branch synced around the build time. Over the
>> last year, I have seen 2 instances where I've had cleared inodes,
>> causing obvious errors:
>> 
>> slave:ksh$ sudo find /home -xdev -ls > /dev/null
>> find: /home/netbsd/cvsroot/pkgsrc/japanese/p5-Jcode/pkg/Attic/PLIST,v: Bad 
>> file descriptor
>> find: 
>> /home/netbsd/cvsroot/pkgsrc/print/texlive-pdftools/patches/Attic/patch-ac,v: 
>> Bad file descriptor
>> 
>> fsdb tells me they're "unallocated inode"s, which I can easily fix,
>> but does anyone have any idea what might be causing them? This
>> appears similar to issues reported previously:
>> http://mail-index.netbsd.org/tech-kern/2013/10/19/msg015770.html
>> 
>> My filesystem is FFSv2 with wapbl, sitting on a raidframe mirror
>> over two SATA drives.
>
>Try unmounting it, and then running fsck -fn on it. Does it report
>errors?
>
>christos

Oh, yes, indeed. And fixes them fine:

** /dev/rraid0g
** File system is already clean
** Last Mounted on /home
** Phase 1 - Check Blocks and Sizes
PARTIALLY ALLOCATED INODE I=106999488
CLEAR? [yn] y

PARTIALLY ALLOCATED INODE I=106999489
CLEAR? [yn] y

** Phase 2 - Check Pathnames
UNALLOCATED  I=106999489  OWNER=0 MODE=0
SIZE=0 MTIME=Jan  1 10:00 1970  
NAME=/netbsd/cvsroot/pkgsrc/japanese/p5-Jcode/pkg/Attic/PLIST,v

REMOVE? [yn] y

UNALLOCATED  I=106999488  OWNER=0 MODE=0
SIZE=0 MTIME=Jan  1 10:00 1970  
NAME=/netbsd/cvsroot/pkgsrc/print/texlive-pdftools/patches/Attic/patch-ac,v

REMOVE? [yn] y

** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? [yn] y

SUMMARY INFORMATION BAD
SALVAGE? [yn] y

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] y

5197833 files, 528560982 used, 396039539 free (1740275 frags, 49287408 blocks, 
0.2% fragmentation)

* FILE SYSTEM WAS MODIFIED *

Running a second fsck pass comes up clean. What surprises me is that my
machine has been up ~100 days... I find it hard to believe that a power
loss or similar unclean shutdown would generate filesystem corruption
that could sit silent for that long before suddenly emerging.

Cheers,
-- 
Paul Ripke

"Great minds discuss ideas, average minds discuss events, small minds discuss 
people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.


Re: FW: ixg(4) performances

2014-08-31 Thread Emmanuel Dreyfus
Terry Moore  wrote:

> Since you did a dword read, the extra 0x9 is the device status register.
> This makes me suspicious as the device status register is claiming that you
> have "unsupported request detected)" [bit 3] and "correctable error
> detected" [bit 0].  Further, this register is RW1C for all these bits -- so
> when you write 94810, it should have cleared the 9 (so a subsequent read
> should have returned 4810).
> 
> Please check.

You are right;
# pcictl /dev/pci5 read -d 0 -f 1 0xa8  
00092810
# pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810
# pcictl /dev/pci5 read -d 0 -f 1 0xa8 
4810

> Might be good to post a "pcictl dump" of your device, just to expose all the
> details.

It explicitely says 2.5 Gb/s x 8 lanes

# pcictl /dev/pci5 dump -d0 -f 1
PCI configuration registers:
  Common header:
0x00: 0x10fb8086 0x00100107 0x0201 0x00800010

Vendor Name: Intel (0x8086)
Device Name: 82599 (SFI/SFP+) 10 GbE Controller (0x10fb)
Command register: 0x0107
  I/O space accesses: on
  Memory space accesses: on
  Bus mastering: on
  Special cycles: off
  MWI transactions: off
  Palette snooping: off
  Parity error checking: off
  Address/data stepping: off
  System error (SERR): on
  Fast back-to-back transactions: off
  Interrupt disable: off
Status register: 0x0010
  Interrupt status: inactive
  Capability List support: on
  66 MHz capable: off
  User Definable Features (UDF) support: off
  Fast back-to-back capable: off
  Data parity error detected: off
  DEVSEL timing: fast (0x0)
  Slave signaled Target Abort: off
  Master received Target Abort: off
  Master received Master Abort: off
  Asserted System Error (SERR): off
  Parity error detected: off
Class Name: network (0x02)
Subclass Name: ethernet (0x00)
Interface: 0x00
Revision ID: 0x01
BIST: 0x00
Header Type: 0x00+multifunction (0x80)
Latency Timer: 0x00
Cache Line Size: 0x10

  Type 0 ("normal" device) header:
0x10: 0xdfe8000c 0x 0xbc01 0x
0x20: 0xdfe7c00c 0x 0x 0x00038086
0x30: 0x 0x0040 0x 0x0209

Base address register at 0x10
  type: 64-bit prefetchable memory
  base: 0xdfe8, not sized
Base address register at 0x18
  type: i/o
  base: 0xbc00, not sized
Base address register at 0x1c
  not implemented(?)
Base address register at 0x20
  type: 64-bit prefetchable memory
  base: 0xdfe7c000, not sized
Cardbus CIS Pointer: 0x
Subsystem vendor ID: 0x8086
Subsystem ID: 0x0003
Expansion ROM Base Address: 0x
Capability list pointer: 0x40
Reserved @ 0x38: 0x
Maximum Latency: 0x00
Minimum Grant: 0x00
Interrupt pin: 0x02 (pin B)
Interrupt line: 0x09

  Capability register at 0x40
type: 0x01 (Power Management, rev. 1.0)
  Capability register at 0x50
type: 0x05 (MSI)
  Capability register at 0x70
type: 0x11 (MSI-X)
  Capability register at 0xa0
type: 0x10 (PCI Express)

  PCI Message Signaled Interrupt
Message Control register: 0x0180
  MSI Enabled: no
  Multiple Message Capable: no (1 vector)
  Multiple Message Enabled: off (1 vector)
  64 Bit Address Capable: yes
  Per-Vector Masking Capable: yes
Message Address (lower) register: 0x
Message Address (upper) register: 0x
Message Data register: 0x
Vector Mask register: 0x
Vector Pending register: 0x

  PCI Power Management Capabilities Register
Capabilities register: 0x4823
  Version: 1.2
  PME# clock: off
  Device specific initialization: on
  3.3V auxiliary current: self-powered
  D1 power management state support: off
  D2 power management state support: off
  PME# support: 0x09
Control/status register: 0x2000
  Power state: D0
  PCI Express reserved: off
  No soft reset: off
  PME# assertion disabled
  PME# status: off

  PCI Express Capabilities Register
Capability version: 2
Device type: PCI Express Endpoint device
Interrupt Message Number: 0
Link Capabilities Register: 0x00027482
  Maximum Link Speed: unknown 2 value
  Maximum Link Width: x8 lanes
  Port Number: 0
Link Status Register: 0x1081
  Negotiated Link Speed: 2.5Gb/s
  Negotiated Link Width: x8 lanes

  Device-dependent header:
0x40: 0x48235001 0x2b002000 0x 0x
0x50: 0x01807005 0x 0x 0x
0x60: 0x 0x 0x 0x
0x70: 0x003fa011 0x0004 0x2004 0x
0x80: 0x 0x 0x 0x
0x90: 0x 0x 0x 0x
0xa0: 0x00020010 0x10008cc2 0x4810 0x00027482
0xb0: 0x1081 0x 0x 0x
0xc0: 0x 0x001f 0x000

RE: FW: ixg(4) performances

2014-08-31 Thread Hisashi T Fujinaka

Oh, and to answer the actual first, relevant question, I can try finding
out if we (day job, 82599) can do line rate at 2.5GT/s. I think we can
get a lot closer than you're getting but we don't test with NetBSD.

--
Hisashi T Fujinaka - ht...@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee


RE: FW: ixg(4) performances

2014-08-31 Thread Hisashi T Fujinaka

I may be wrong in the transactions/transfers. However, I think you're
reading the page incorrectly. The signalling rate is the physical speed
of the link. On top of that is the 8/10 encoding (the Ethernet
controller we're talking about is only Gen 2), the framing, etc, and
the spec discusses the data rate in GT/s. Gb/s means nothing.

It's like talking about the frequency of the Ethernet link, which we
never do. We talk about how much data can be transferred.

I'm also not sure if you've looked at an actual trace before, but a PCIe
link is incredibly chatty, and every transfer only has a payload of
64/128/256b (especially regarding the actual controller again).

So, those two coupled together (GT/s & chatty link with small packets)
means talking about things in Gb/s is not something used by people who
talk about PCIe every day (my day job). The signalling rate is not used
when talking about the max data transfer rate.

On Sun, 31 Aug 2014, Terry Moore wrote:


-Original Message-
From: Hisashi T Fujinaka [mailto:ht...@twofifty.com]
Sent: Saturday, August 30, 2014 21:29
To: Terry Moore
Cc: tech-kern@netbsd.org
Subject: Re: FW: ixg(4) performances

Doesn't anyone read my posts or, more important, the PCIe spec?

2.5 Giga TRANSFERS per second.


I'm not sure I understand what you're saying.


From the PCIe space, page 40:


"Signaling rate - Once initialized, each Link must only operate at one of
the supported signaling
levels. For the first generation of PCI Express technology, there is only
one signaling rate
defined, which provides an effective 2.5 Gigabits/second/Lane/direction of
raw bandwidth.
The second generation provides an effective 5.0
Gigabits/second/Lane/direction of raw
bandwidth. The third generation provides an effective 8.0
Gigabits/second/Lane/direction of
10 raw bandwidth. The data rate is expected to increase with technology
advances in the future."

This is not 2.5G Transfers per second. PCIe talks about transactions rather
than transfers; one transaction requires either 12 bytes (for 32-bit
systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
layer, plus 7 bytes at the link layer.

The maximum number of transactions per second paradoxically transfers the
fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
about 60,000 such transactions are possible per second (moving about
248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
for example 95% efficiency is typical for storage controllers.]

A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
transactions are possible per second, but those 9 million transactions can
only move 36 million bytes.

Multiple lanes scale things fairly linearly. But there has to be one byte
per lane; a x8 configuration says that physical transfers are padded so that
each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
bytes. Instead of getting 72 million transactions per second, you get 62.5
million transactions/second, so it doesn't scale as nicely.

Reads are harder to analyze, because they depend on the speed and design of
both ends of the link. The reader sends a read request packet, and the
read-responder (some time later) sends back the response.

As far as I can see, even at gen3 with lots of lanes, PCIe doesn't scale to
2.5 G transfers per second.

Best regards,
--Terry




--
Hisashi T Fujinaka - ht...@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee


RE: FW: ixg(4) performances

2014-08-31 Thread Terry Moore
> -Original Message-
> From: Hisashi T Fujinaka [mailto:ht...@twofifty.com]
> Sent: Saturday, August 30, 2014 21:29
> To: Terry Moore
> Cc: tech-kern@netbsd.org
> Subject: Re: FW: ixg(4) performances
> 
> Doesn't anyone read my posts or, more important, the PCIe spec?
> 
> 2.5 Giga TRANSFERS per second.

I'm not sure I understand what you're saying.

>From the PCIe space, page 40:

"Signaling rate - Once initialized, each Link must only operate at one of
the supported signaling
levels. For the first generation of PCI Express technology, there is only
one signaling rate
defined, which provides an effective 2.5 Gigabits/second/Lane/direction of
raw bandwidth.
The second generation provides an effective 5.0
Gigabits/second/Lane/direction of raw
bandwidth. The third generation provides an effective 8.0
Gigabits/second/Lane/direction of
10 raw bandwidth. The data rate is expected to increase with technology
advances in the future."

This is not 2.5G Transfers per second. PCIe talks about transactions rather
than transfers; one transaction requires either 12 bytes (for 32-bit
systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
layer, plus 7 bytes at the link layer. 

The maximum number of transactions per second paradoxically transfers the
fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
about 60,000 such transactions are possible per second (moving about
248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
for example 95% efficiency is typical for storage controllers.]   

A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
transactions are possible per second, but those 9 million transactions can
only move 36 million bytes.

Multiple lanes scale things fairly linearly. But there has to be one byte
per lane; a x8 configuration says that physical transfers are padded so that
each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
bytes. Instead of getting 72 million transactions per second, you get 62.5
million transactions/second, so it doesn't scale as nicely.

Reads are harder to analyze, because they depend on the speed and design of
both ends of the link. The reader sends a read request packet, and the
read-responder (some time later) sends back the response. 

As far as I can see, even at gen3 with lots of lanes, PCIe doesn't scale to
2.5 G transfers per second.

Best regards,
--Terry




Re: Making bpf MPSAFE (was Re: struct ifnet and ifaddr handling ...)

2014-08-31 Thread Ryota Ozaki
Hi darrenr and rmind,

Thank you for your replying, and I'm sorry for not replying yet.
I'm now in a busy period for several weeks. I'll be back at the
next weekend.

  ozaki-r

On Tue, Aug 26, 2014 at 4:49 AM, Mindaugas Rasiukevicius
 wrote:
> Ryota Ozaki  wrote:
>> Hi,
>>
>> I thought I need more experience of pserialize
>> (and lock primitives) to tackle ifnet work.
>> So I suspended the work and now I am trying
>> another easier task, bpf.
>>
>> http://www.netbsd.org/~ozaki-r/mpsafe-bpf.diff
>>
>
> As Darren mentioned - there are various bugs in the code (also, malloc
> change to M_NOWAIT is unhandled).  You cannot drop the lock on the floor
> and expect the state to stay consistent.  Something has to preserve it.
>
> The following pattern applies both to the locks and pserialize(9):
>
> pserialize_read_enter();
> obj = lookup();
> pserialize_read_exit();
> // the object is volatile here, it might be already destroyed
>
> Nothing prevents obj from being destroyed after the critical path unless
> you acquire some form of a reference during the lookup.
>
>> BTW, I worry about there is no easy way to
>> know if a function in a critical section
>> blocks/sleeps or not. So I wrote a patch to
>> detect that: http://www.netbsd.org/~ozaki-r/debug-pserialize.diff
>> Is it meaningful?
>
> Why per-LWP rather than per-CPU diagnostic counter?  On the other hand, if
> we start adding per-CPU counters, then we might as well start implementing
> RCU. :)
>
> --
> Mindaugas