Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-26 Thread Christos Zoulas
In article <20140825213735.ga14...@britannica.bec.de>,
Joerg Sonnenberger   wrote:
>On Mon, Aug 25, 2014 at 09:09:24PM +, Taylor R Campbell wrote:
>>Date: Mon, 25 Aug 2014 20:02:44 +0200
>>From: "J. Hannken-Illjes" 
>> 
>>Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
>>from Rev. 1.3 to 1.2 brings back the old times which are the same as
>>they were on NetBSD 6.
>> 
>>Given that this test has many calls to ufs_lookup/cache_lookup using
>>memcmp to check for equal filenames this is not a surprise.
>> 
>>A rather naive "implementation" of memcmp (see below) drops the running
>>time from ~15 sec to ~9 secs.  We should consider improving our memcmp.
>> 
>> Sounds reasonable to me, although it looks like GCC's old builtin
>> memcmp expansion actually failed to implement our specification: it
>> returns -1, 0, or +1, like your patch, rather than the difference of
>> the first differing bytes or zero as our man page specifies.  For most
>> uses it doesn't matter, of course, but we ought to make sure to follow
>> our own specification.
>
>memcmp is only supposed to provide the correct sign, not the difference.

Yes, according to TOG, not according to our documentation. Not that I advocate
to keep our documentation

christos



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-26 Thread Alan Barrett

On Tue, 26 Aug 2014, Robert Elz wrote:

 | > memcmp is only supposed to provide the correct sign, not
 | > the difference.
 | true, but that's not what memcmp(9) says.

This is a "normal" problem with man pages - they're written to 
document what the code actually does, then interpreted as a 
specification of what the code is required to do.  Man pages 
should be the former, the latter is the job of standards docs.


Often, there are no standards docs, and the man page has to serve 
as both a specification of the parts of the interface that users 
can depend on, and documentation of what the code actually does. 
For example, it's possible to document "returns -ve, 0, or +ve" 
in one part of the man page, as an interface specification, and 
"returns the difference" in another part of th man page, as an 
implementation note.


If anything needs changing, it would be to make it more clear 
that the man pages should not be interpreted as an interface 
specification, but as a statement of what the implementations 
actually do - not to be interpreted as a promise that they will 
always do that - for what can be relied upon a reference should 
be made to the relevant standard (which can be POSIX (or IEEE 
for C, or anyone else), or POSIX (etc) as amended by NetBSD, or 
a NetBSD private standard for stuff that either isn't documented 
by anyone else's standards doc, or where NetBSD's version has 
simply decided to be different.


In cases where there really is a standard that can be referred to, 
that might work, but I like to have all the information in one 
place.  If it's easy for the NetBSD man page to say both what's 
promised, and what is actually done, then I would like it to do 
so.  I think that this helps both people using the interface and 
people changing the implementation.


--apb (Alan Barrett)


ixg(4) performances

2014-08-26 Thread Emmanuel Dreyfus
Hi

ixgb(4) has poor performances, even on latest -current. Here is the
dmesg output:
ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network Driver, 
Version - 2.3.10
ixg1: clearing prefetchable bit
ixg1: interrupting at ioapic0 pin 9
ixg1: PCI Express Bus: Speed 2.5Gb/s Width x8

The interface is configued with:
ifconfig ixg1 mtu 9000 tso4 ip4csum tcp4csum-tx udp4csum-tx

And sysctl:
kern.sbmax = 67108864
kern.somaxkva = 67108864
net.inet.udp.sendspace = 2097152
net.inet.udp.recvspace = 2097152
net.inet.tcp.sendspace = 2097152
net.inet.tcp.recvspace = 2097152
net.inet.tcp.recvbuf_auto = 0
net.inet.tcp.sendbuf_auto = 0

netperfs shows a maximum performance of 2.3 Gb/s. That let me with
the feeling that only a PCI lane is used. Is it possible?

I also found this page that tackles the same problem on Linux:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe

They tweak the PCI MMRBC. Anyone has an idea of how it could be 
done on NetBSD? I thought about borrowing code from src/sys/dec/pci/if_dge.c
but I am not sure what pci_conf_read/pci_conf_write commands should be used.

Any other idea on how to improve performance?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: ixg(4) performances

2014-08-26 Thread Christos Zoulas
In article <20140826121728.gl23...@homeworld.netbsd.org>,
Emmanuel Dreyfus   wrote:
>Hi
>
>ixgb(4) has poor performances, even on latest -current. Here is the
>dmesg output:
>ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network
>Driver, Version - 2.3.10
>ixg1: clearing prefetchable bit
>ixg1: interrupting at ioapic0 pin 9
>ixg1: PCI Express Bus: Speed 2.5Gb/s Width x8
>
>The interface is configued with:
>ifconfig ixg1 mtu 9000 tso4 ip4csum tcp4csum-tx udp4csum-tx
>
>And sysctl:
>kern.sbmax = 67108864
>kern.somaxkva = 67108864
>net.inet.udp.sendspace = 2097152
>net.inet.udp.recvspace = 2097152
>net.inet.tcp.sendspace = 2097152
>net.inet.tcp.recvspace = 2097152
>net.inet.tcp.recvbuf_auto = 0
>net.inet.tcp.sendbuf_auto = 0
>
>netperfs shows a maximum performance of 2.3 Gb/s. That let me with
>the feeling that only a PCI lane is used. Is it possible?
>
>I also found this page that tackles the same problem on Linux:
>http://dak1n1.com/blog/7-performance-tuning-intel-10gbe
>
>They tweak the PCI MMRBC. Anyone has an idea of how it could be 
>done on NetBSD? I thought about borrowing code from src/sys/dec/pci/if_dge.c
>but I am not sure what pci_conf_read/pci_conf_write commands should be used.
>
>Any other idea on how to improve performance?

ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC



Re: ixg(4) performances

2014-08-26 Thread Emmanuel Dreyfus
On Tue, Aug 26, 2014 at 12:57:37PM +, Christos Zoulas wrote:
> ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC

Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
the BIOS has no setting for it, NetBSD is screwed.

I see   has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
does that means Linux's setpci can be easily reproduced?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: ixg(4) performances

2014-08-26 Thread Christos Zoulas
On Aug 26,  2:23pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: ixg(4) performances

| On Tue, Aug 26, 2014 at 12:57:37PM +, Christos Zoulas wrote:
| > 
ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC
| 
| Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
| the BIOS has no setting for it, NetBSD is screwed.
| 
| I see   has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
| does that means Linux's setpci can be easily reproduced?

I would probably extend pcictl with cfgread and cfgwrite commands.

christos


Re: ixg(4) performances

2014-08-26 Thread Emmanuel Dreyfus
On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
> I would probably extend pcictl with cfgread and cfgwrite commands.

Sure, once it works I can do that, but a first attempt just
ets EINVAL, any idea what can be wrong?

int fd;
struct  pciio_bdf_cfgreg pbcr;

if ((fd = open("/dev/pci5", O_RDWR, 0)) == -1)
err(EX_OSERR, "open /dev/pci5 failed");

pbcr.bus = 5;
pbcr.device = 0;
pbcr.function = 0;
pbcr.cfgreg.reg = 0xe6b;
pbcr.cfgreg.val = 0x2e;

if (ioctl(fd, PCI_IOC_BDF_CFGWRITE, &pbcr) == -1)
err(EX_OSERR, "ioctl failed");

Inside the kernel, the only EINVAL is here:
if (bdfr->bus > 255 || bdfr->device >= sc->sc_maxndevs ||
bdfr->function > 7)
return EINVAL;

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: ixg(4) performances

2014-08-26 Thread Christos Zoulas
On Aug 26,  2:42pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: ixg(4) performances

| On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
| > I would probably extend pcictl with cfgread and cfgwrite commands.
| 
| Sure, once it works I can do that, but a first attempt just
| ets EINVAL, any idea what can be wrong?
| 
| int fd;
| struct  pciio_bdf_cfgreg pbcr;
| 
| if ((fd = open("/dev/pci5", O_RDWR, 0)) == -1)
| err(EX_OSERR, "open /dev/pci5 failed");
| 
| pbcr.bus = 5;
| pbcr.device = 0;
| pbcr.function = 0;
| pbcr.cfgreg.reg = 0xe6b;
| pbcr.cfgreg.val = 0x2e;

I think in the example that was 0xe6. I think the .b means byte access
(I am guessing). I think that we are only doing word accesses, thus
we probably need to read, mask modify write the byte. I have not
verified any of that, these are guesses... Look at the pcictl source
code.

| 
| if (ioctl(fd, PCI_IOC_BDF_CFGWRITE, &pbcr) == -1)
| err(EX_OSERR, "ioctl failed");
| 
| Inside the kernel, the only EINVAL is here:
| if (bdfr->bus > 255 || bdfr->device >= sc->sc_maxndevs ||
| bdfr->function > 7)
| return EINVAL;
| 
| -- 
| Emmanuel Dreyfus
| m...@netbsd.org
-- End of excerpt from Emmanuel Dreyfus




Re: ixg(4) performances

2014-08-26 Thread Taylor R Campbell
   Date: Tue, 26 Aug 2014 10:25:52 -0400
   From: chris...@zoulas.com (Christos Zoulas)

   On Aug 26,  2:23pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
   -- Subject: Re: ixg(4) performances

   | I see   has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
   | does that means Linux's setpci can be easily reproduced?

   I would probably extend pcictl with cfgread and cfgwrite commands.

How about the attached patch?  I've been sitting on this for months.
Index: usr.sbin/pcictl/pcictl.8
===
RCS file: /cvsroot/src/usr.sbin/pcictl/pcictl.8,v
retrieving revision 1.10
diff -p -u -r1.10 pcictl.8
--- usr.sbin/pcictl/pcictl.825 Feb 2011 21:40:48 -  1.10
+++ usr.sbin/pcictl/pcictl.826 Aug 2014 15:38:55 -
@@ -33,7 +33,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd February 25, 2011
+.Dd June 12, 2014
 .Dt PCICTL 8
 .Os
 .Sh NAME
@@ -79,6 +79,31 @@ at the specified bus, device, and functi
 If the bus is not specified, it defaults to the bus number of the
 PCI bus specified on the command line.
 If the function is not specified, it defaults to 0.
+.Pp
+.Nm read
+.Op Fl b Ar bus
+.Fl d Ar device
+.Op Fl f Ar function
+.Ar reg
+.Pp
+Read the specified 32-bit aligned PCI configuration register and print
+it in hexadecimal to standard output.
+If the bus is not specified, it defaults to the bus number of the
+PCI bus specified on the command line.
+If the function is not specified, it defaults to 0.
+.Pp
+.Nm write
+.Op Fl b Ar bus
+.Fl d Ar device
+.Op Fl f Ar function
+.Ar reg
+.Ar value
+.Pp
+Write the specified value to the specified 32-bit aligned PCI
+configuration register.
+If the bus is not specified, it defaults to the bus number of the
+PCI bus specified on the command line.
+If the function is not specified, it defaults to 0.
 .Sh FILES
 .Pa /dev/pci*
 - PCI bus device nodes
Index: usr.sbin/pcictl/pcictl.c
===
RCS file: /cvsroot/src/usr.sbin/pcictl/pcictl.c,v
retrieving revision 1.18
diff -p -u -r1.18 pcictl.c
--- usr.sbin/pcictl/pcictl.c30 Aug 2011 20:08:38 -  1.18
+++ usr.sbin/pcictl/pcictl.c26 Aug 2014 15:38:55 -
@@ -76,6 +76,8 @@ static intprint_numbers = 0;
 
 static voidcmd_list(int, char *[]);
 static voidcmd_dump(int, char *[]);
+static voidcmd_read(int, char *[]);
+static voidcmd_write(int, char *[]);
 
 static const struct command commands[] = {
{ "list",
@@ -88,10 +90,21 @@ static const struct command commands[] =
  cmd_dump,
  O_RDONLY },
 
+   { "read",
+ "[-b bus] -d device [-f function] reg",
+ cmd_read,
+ O_RDONLY },
+
+   { "write",
+ "[-b bus] -d device [-f function] reg value",
+ cmd_write,
+ O_WRONLY },
+
{ 0, 0, 0, 0 },
 };
 
 static int parse_bdf(const char *);
+static u_int   parse_reg(const char *);
 
 static voidscan_pci(int, int, int, void (*)(u_int, u_int, u_int));
 
@@ -230,6 +243,87 @@ cmd_dump(int argc, char *argv[])
scan_pci(bus, dev, func, scan_pci_dump);
 }
 
+static void
+cmd_read(int argc, char *argv[])
+{
+   int bus, dev, func;
+   u_int reg;
+   pcireg_t value;
+   int ch;
+
+   bus = pci_businfo.busno;
+   func = 0;
+   dev = -1;
+
+   while ((ch = getopt(argc, argv, "b:d:f:")) != -1) {
+   switch (ch) {
+   case 'b':
+   bus = parse_bdf(optarg);
+   break;
+   case 'd':
+   dev = parse_bdf(optarg);
+   break;
+   case 'f':
+   func = parse_bdf(optarg);
+   break;
+   default:
+   usage();
+   }
+   }
+   argv += optind;
+   argc -= optind;
+
+   if (argc != 1)
+   usage();
+   reg = parse_reg(argv[0]);
+   if (pcibus_conf_read(pcifd, bus, dev, func, reg, &value) == -1)
+   err(EXIT_FAILURE, "pcibus_conf_read"
+   "(bus %d dev %d func %d reg %u)", bus, dev, func, reg);
+   if (printf("%08x\n", value) < 0)
+   err(EXIT_FAILURE, "printf");
+}
+
+static void
+cmd_write(int argc, char *argv[])
+{
+   int bus, dev, func;
+   u_int reg;
+   pcireg_t value;
+   int ch;
+
+   bus = pci_businfo.busno;
+   func = 0;
+   dev = -1;
+
+   while ((ch = getopt(argc, argv, "b:d:f:")) != -1) {
+   switch (ch) {
+   case 'b':
+   bus = parse_bdf(optarg);
+   break;
+   case 'd':
+   dev = parse_bdf(optarg);
+   break;
+   case 'f':
+   func = parse_bdf(optarg);
+   break;
+   default:
+ 

Re: ixg(4) performances

2014-08-26 Thread Emmanuel Dreyfus
On Tue, Aug 26, 2014 at 11:13:50AM -0400, Christos Zoulas wrote:
> I think in the example that was 0xe6. I think the .b means byte access
> (I am guessing). 

Yes, I came to that conclusion reading pciutils sources. I discovered
they also had a man page explaining that -)

> I think that we are only doing word accesses, thus
> we probably need to read, mask modify write the byte. I have not
> verified any of that, these are guesses... Look at the pcictl source
> code.

I try writting at register 0xe4, but when reading again it is still 0. 

if (pcibus_conf_read(fd, 5, 0, 1, 0x00e4, &val) != 0)
err(EX_OSERR, "pcibus_conf_read failed");

printf("reg = 0x00e4,  val = 0x%08x\n", val);

val = (val & 0xff00) | 0x002e;

if (pcibus_conf_write(fd, 5, 0, 1, 0x00e4, val) != 0)
err(EX_OSERR, "pcibus_conf_write failed");


-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: ixg(4) performances

2014-08-26 Thread Taylor R Campbell
   Date: Tue, 26 Aug 2014 15:40:41 +
   From: Taylor R Campbell 

   How about the attached patch?  I've been sitting on this for months.

New version with some changes suggested by wiz@.
Index: usr.sbin/pcictl/pcictl.8
===
RCS file: /cvsroot/src/usr.sbin/pcictl/pcictl.8,v
retrieving revision 1.11
diff -p -u -r1.11 pcictl.8
--- usr.sbin/pcictl/pcictl.826 Aug 2014 16:21:15 -  1.11
+++ usr.sbin/pcictl/pcictl.826 Aug 2014 16:38:36 -
@@ -33,7 +33,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd February 25, 2011
+.Dd June 12, 2014
 .Dt PCICTL 8
 .Os
 .Sh NAME
@@ -79,6 +79,31 @@ at the specified bus, device, and functi
 If the bus is not specified, it defaults to the bus number of the
 PCI bus specified on the command line.
 If the function is not specified, it defaults to 0.
+.Pp
+.Cm read
+.Op Fl b Ar bus
+.Fl d Ar device
+.Op Fl f Ar function
+.Ar reg
+.Pp
+Read the specified 32-bit aligned PCI configuration register and print
+it in hexadecimal to standard output.
+If the bus is not specified, it defaults to the bus number of the
+PCI bus specified on the command line.
+If the function is not specified, it defaults to 0.
+.Pp
+.Cm write
+.Op Fl b Ar bus
+.Fl d Ar device
+.Op Fl f Ar function
+.Ar reg
+.Ar value
+.Pp
+Write the specified value to the specified 32-bit aligned PCI
+configuration register.
+If the bus is not specified, it defaults to the bus number of the
+PCI bus specified on the command line.
+If the function is not specified, it defaults to 0.
 .Sh FILES
 .Pa /dev/pci*
 - PCI bus device nodes
Index: usr.sbin/pcictl/pcictl.c
===
RCS file: /cvsroot/src/usr.sbin/pcictl/pcictl.c,v
retrieving revision 1.18
diff -p -u -r1.18 pcictl.c
--- usr.sbin/pcictl/pcictl.c30 Aug 2011 20:08:38 -  1.18
+++ usr.sbin/pcictl/pcictl.c26 Aug 2014 16:38:36 -
@@ -76,6 +76,8 @@ static intprint_numbers = 0;
 
 static voidcmd_list(int, char *[]);
 static voidcmd_dump(int, char *[]);
+static voidcmd_read(int, char *[]);
+static voidcmd_write(int, char *[]);
 
 static const struct command commands[] = {
{ "list",
@@ -88,10 +90,21 @@ static const struct command commands[] =
  cmd_dump,
  O_RDONLY },
 
+   { "read",
+ "[-b bus] -d device [-f function] reg",
+ cmd_read,
+ O_RDONLY },
+
+   { "write",
+ "[-b bus] -d device [-f function] reg value",
+ cmd_write,
+ O_WRONLY },
+
{ 0, 0, 0, 0 },
 };
 
 static int parse_bdf(const char *);
+static u_int   parse_reg(const char *);
 
 static voidscan_pci(int, int, int, void (*)(u_int, u_int, u_int));
 
@@ -230,6 +243,91 @@ cmd_dump(int argc, char *argv[])
scan_pci(bus, dev, func, scan_pci_dump);
 }
 
+static void
+cmd_read(int argc, char *argv[])
+{
+   int bus, dev, func;
+   u_int reg;
+   pcireg_t value;
+   int ch;
+
+   bus = pci_businfo.busno;
+   func = 0;
+   dev = -1;
+
+   while ((ch = getopt(argc, argv, "b:d:f:")) != -1) {
+   switch (ch) {
+   case 'b':
+   bus = parse_bdf(optarg);
+   break;
+   case 'd':
+   dev = parse_bdf(optarg);
+   break;
+   case 'f':
+   func = parse_bdf(optarg);
+   break;
+   default:
+   usage();
+   }
+   }
+   argv += optind;
+   argc -= optind;
+
+   if (argc != 1)
+   usage();
+   if (dev == -1)
+   errx(EXIT_FAILURE, "read: must specify a device number");
+   reg = parse_reg(argv[0]);
+   if (pcibus_conf_read(pcifd, bus, dev, func, reg, &value) == -1)
+   err(EXIT_FAILURE, "pcibus_conf_read"
+   "(bus %d dev %d func %d reg %u)", bus, dev, func, reg);
+   if (printf("%08x\n", value) < 0)
+   err(EXIT_FAILURE, "printf");
+}
+
+static void
+cmd_write(int argc, char *argv[])
+{
+   int bus, dev, func;
+   u_int reg;
+   pcireg_t value;
+   int ch;
+
+   bus = pci_businfo.busno;
+   func = 0;
+   dev = -1;
+
+   while ((ch = getopt(argc, argv, "b:d:f:")) != -1) {
+   switch (ch) {
+   case 'b':
+   bus = parse_bdf(optarg);
+   break;
+   case 'd':
+   dev = parse_bdf(optarg);
+   break;
+   case 'f':
+   func = parse_bdf(optarg);
+   break;
+   default:
+   usage();
+   }
+   }
+   argv += optind;
+   argc -= optind;
+
+   if (argc != 2)
+   usage();
+   if (dev == -1)
+

Re: ixg(4) performances

2014-08-26 Thread Taylor R Campbell
   Date: Tue, 26 Aug 2014 14:42:55 +
   From: Emmanuel Dreyfus 

   On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
   > I would probably extend pcictl with cfgread and cfgwrite commands.

   Sure, once it works I can do that, but a first attempt just
   ets EINVAL, any idea what can be wrong?
   ...
   pbcr.bus = 5;
   pbcr.device = 0;
   pbcr.function = 0;
   pbcr.cfgreg.reg = 0xe6b;
   pbcr.cfgreg.val = 0x2e;

Can't do unaligned register reads/writes.  If you need other than
32-bit access, you need to select subwords for reads or do R/M/W for
writes.

   Inside the kernel, the only EINVAL is here:
   if (bdfr->bus > 255 || bdfr->device >= sc->sc_maxndevs ||
   bdfr->function > 7)
   return EINVAL;

Old kernel sources?  I added a check recently for 32-bit alignment --
without which you'd hit a kassert or hardware trap shortly afterward.


Re: ixg(4) performances

2014-08-26 Thread David Young
On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
> On Aug 26,  2:23pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
> -- Subject: Re: ixg(4) performances
> 
> | On Tue, Aug 26, 2014 at 12:57:37PM +, Christos Zoulas wrote:
> | > 
> ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC
> | 
> | Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
> | the BIOS has no setting for it, NetBSD is screwed.
> | 
> | I see   has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
> | does that means Linux's setpci can be easily reproduced?
> 
> I would probably extend pcictl with cfgread and cfgwrite commands.

Emmanuel,

Most (all?) configuration registers are read/write.  Have you read the
MMRBC and found that it's improperly configured?

Are you sure that you don't have to program the MMRBC at every bus
bridge between the NIC and RAM?  I'm not too familiar with PCI Express,
so I really don't know.

Have you verified the information at
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe with the 82599
manual?  I have tried to corroborate the information both with my PCI
Express book and with the 82599 manual, but I cannot make a match.
PCI-X != PCI Express; maybe ixgb != ixgbe?  (It sure looks like they're
writing about an 82599, but maybe they don't know what they're writing
about!)


Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
the wrong direction.  I know that this is UNIX and we're duty-bound to
give everyone enough rope, but may we reconsider our assisted-suicide
policy just this one time? :-)

How well has blindly poking configuration registers worked for us in
the past?  I can think of a couple of instances where an knowledgeable
developer thought that they were writing a helpful value to a useful
register and getting a desirable result, but in the end it turned out to
be a no-op.  In one case, it was an Atheros WLAN adapter where somebody
added to Linux some code that wrote to a mysterious PCI configuration
register, and then some of the *BSDs copied it.  In the other case, I
think that somebody used pci_conf_write() to write a magic value to a
USB host controller register that wasn't on a 32-bit boundary.  ISTR
that some incorrect value was written, instead.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


re: ixg(4) performances

2014-08-26 Thread matthew green

> Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
> the wrong direction.  I know that this is UNIX and we're duty-bound to
> give everyone enough rope, but may we reconsider our assisted-suicide
> policy just this one time? :-)
> 
> How well has blindly poking configuration registers worked for us in
> the past?  I can think of a couple of instances where an knowledgeable
> developer thought that they were writing a helpful value to a useful
> register and getting a desirable result, but in the end it turned out to
> be a no-op.  In one case, it was an Atheros WLAN adapter where somebody
> added to Linux some code that wrote to a mysterious PCI configuration
> register, and then some of the *BSDs copied it.  In the other case, I
> think that somebody used pci_conf_write() to write a magic value to a
> USB host controller register that wasn't on a 32-bit boundary.  ISTR
> that some incorrect value was written, instead.

pciutils' "setpci" utility has exposed this for lots of systems for
years.  i don't see any value in keeping pcictl from being as usable
as other tools, and as you say, this is unix - rope and all.


.mrg.


Re: ixg(4) performances

2014-08-26 Thread Hisashi T Fujinaka

On Tue, 26 Aug 2014, David Young wrote:


How well has blindly poking configuration registers worked for us in
the past?


Well, with the part he's using (the 82599, I think) it shouldn't be that
blind. The datasheet has all the registers listed, which is the case for
most of Intel's Ethernet controllers.

--
Hisashi T Fujinaka - ht...@twofifty.com
BSEE(6/86) + BSChem(3/95) + BAEnglish(8/95) + MSCS(8/03) + $2.50 = latte


Re: RFC: IRQ affinity (aka interrupt routing)

2014-08-26 Thread Kengo NAKAHARA

Hi,

Thank you for reviewing.

(2014/08/26 5:15), Mindaugas Rasiukevicius wrote:

Kengo NAKAHARA  wrote:

Sorry, I typo the patch URL.

(2014/08/20 18:06), Kengo NAKAHARA wrote:

and here is the patch
  http://knakahara.github.io/patches/netbsd/irq-affinity-initctl.patch


 http://knakahara.github.io/patches/netbsd/irq-affinity-intrctl.patch


Have to admit that I did not read the patch carefully, but why
io_interrupt_sources_lock is __cpu_simple_lock?  Why not to re-use
cpu_lock?  The locking itself does not seem to be correct either.


Because I wanted to avoid lock contention between IRQ affinity
and process affinity (in paticular sys__sched_setaffinity() in
sys/kern/sys_sched.c), but now I find it is a wrong idea.
I should delete __cpu_simple_lock and modify to re-use cpu_lock.


How much of the IRQ affinity code (in x86/intr.c) is actually MD?
It seems that a lot of that can be made MI (think of kern/subr_intr.c).


I think MI part of IRQ affinity code is not so much, however MI code
surely exists. So, I divide MD part from MI code as much as possible,
and then I move MI code to kern/subr_intr.c.


Also, please do not forget to add the BSD license text for the newly
created files.


Yes, I add the BSD license text.

Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA 


Re: RFC: IRQ affinity (aka interrupt routing)

2014-08-26 Thread Matt Thomas

As I've been reading this discussion, it seems very x86 centric.

I've thinking about adding 

void intr_distribute(void *ih, const kcpuset_t *newset, kcpuset_t *oldset)

for my ports that can do MP.  This could be used to obtain the current
set of cpus setup to receive interrupt for  or set a new sets of
cpus.  To set an interrupt across all CPUs, you could use

intr_distribibute(ih, &kcpuset_running, NULL);

By default only the boot CPU would be setup to get interrupts when
the interrupt was established.




Re: ixg(4) performances

2014-08-26 Thread Thor Lancelot Simon
On Tue, Aug 26, 2014 at 12:17:28PM +, Emmanuel Dreyfus wrote:
> Hi
> 
> ixgb(4) has poor performances, even on latest -current. Here is the
> dmesg output:
> ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network Driver, 
> Version - 2.3.10
> ixg1: clearing prefetchable bit
> ixg1: interrupting at ioapic0 pin 9
> ixg1: PCI Express Bus: Speed 2.5Gb/s Width x8
> 
> The interface is configued with:
> ifconfig ixg1 mtu 9000 tso4 ip4csum tcp4csum-tx udp4csum-tx

MTU 9000 considered harmful.  Use something that fits in 8K with the headers.
It's a minor piece of the puzzle but nonetheless, it's a piece.

Thor


Re: ixg(4) performances

2014-08-26 Thread Thor Lancelot Simon
On Tue, Aug 26, 2014 at 07:03:06PM -0700, Jonathan Stone wrote:
> Thor,
> 
> The NetBSD  TCP stack can't handle 8K payload by page-flipping the payload 
> and prepending an mbuf for XDR/NFS/TCP/IP headers? Or is the issue the extra 
> page-mapping for the prepended mbuf?

The issue is allocating the extra page for a milligram of data.  It is almost
always a lose.  Better to choose the MTU so that the whole packet fits neatly
in 8192 bytes.

It is helpful to understand where MTU 9000 came from: SGI was trying to
optimise UDP NFS performance, for NFSv2 with 8K maximum RPC size, on
systems that had 16K pages.  You can't fit two of that kind of NFS request
in a 16K page, so you might as well allocate something a little bigger than
8K but that happens to leave your memory allocator some useful-sized chunks
to hand out to other callers.

I am a little hazy on the details, but I believe they ended up at MTU 9024
which is 8K + 768 + 64 (leaving a bunch of handy power-of-2 split sizes
as residuals: 4096 + 2048 + 1024 + 128 + 64) which just made no sense to
anyone else so everyone _else_ picked random sizes around 9000 that happened
to work for their hardware.  But at the end of the day, if you do not have
16K pages or are not optimizing for 8K NFSv2 requests on UDP, an MTU that
fits in 8K is almost always better.

Thor


Re: ixg(4) performances

2014-08-26 Thread Taylor R Campbell
   Date: Tue, 26 Aug 2014 12:44:43 -0500
   From: David Young 

   Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
   the wrong direction.  I know that this is UNIX and we're duty-bound to
   give everyone enough rope, but may we reconsider our assisted-suicide
   policy just this one time? :-)

It's certainly wrong to rely on pcictl to read and write config
registers, but it's useful as a debugging tool and for driver
development -- just like the rest of pcictl.


Re: ixg(4) performances

2014-08-26 Thread Jonathan Stone
Thor,

The NetBSD  TCP stack can't handle 8K payload by page-flipping the payload and 
prepending an mbuf for XDR/NFS/TCP/IP headers? Or is the issue the extra 
page-mapping for the prepended mbuf?


On Tue, 8/26/14, Thor Lancelot Simon  wrote:

 Subject: Re: ixg(4) performances
 To: "Emmanuel Dreyfus" 
 Cc: tech-kern@netbsd.org
 Date: Tuesday, August 26, 2014, 6:56 PM
 
[...]
 
 MTU 9000 considered harmful.  Use something
 that fits in 8K with the headers.
 It's a
 minor piece of the puzzle but nonetheless, it's a
 piece.
 
 Thor



Re: ixg(4) performances

2014-08-26 Thread Emmanuel Dreyfus
Thor Lancelot Simon  wrote:

> MTU 9000 considered harmful.  Use something that fits in 8K with the headers.
> It's a minor piece of the puzzle but nonetheless, it's a piece.

mtu 8192 or 8000 does not cause any improvement over mtu 9000.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: RFC: IRQ affinity (aka interrupt routing)

2014-08-26 Thread Kengo NAKAHARA

Hi,

Thank you for your idea.

(2014/08/27 10:09), Matt Thomas wrote:


As I've been reading this discussion, it seems very x86 centric.

I've thinking about adding

void intr_distribute(void *ih, const kcpuset_t *newset, kcpuset_t *oldset)

for my ports that can do MP.  This could be used to obtain the current
set of cpus setup to receive interrupt for  or set a new sets of
cpus.  To set an interrupt across all CPUs, you could use

intr_distribibute(ih, &kcpuset_running, NULL);

By default only the boot CPU would be setup to get interrupts when
the interrupt was established.


It seems good, except return value. IRQ affinity may fail (e.g. when
all cpus are set "nointr" flag), so return value should not be void.
I use the API in MI code like this,

intrctl_ioctl(..., void *data, ...)
{
switch(cmd) {
case IOC_INTR_AFFINITY:
ih = intr_handler(data->intrid);
if (ih == NULL )
return EINVAL;

kcpuset_create(&intr_cpuset, true);
kcpuset_set(intr_cpuset, data->cpuid);
error = intr_distribute(ih, intr_cpuset, NULL);
break;
}

return error;
}

Could you comment this design?


BTW, how do you think about MSI/MSI-X proposal?
- yours
  http://mail-index.netbsd.org/tech-kern/2011/08/05/msg011130.html
- dyoung's
  http://mail-index.netbsd.org/tech-kern/2014/06/06/msg017209.html
- mine
  http://mail-index.netbsd.org/tech-kern/2014/07/10/msg017336.html

Thanks,

--
//
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA 


Re: RFC: IRQ affinity (aka interrupt routing)

2014-08-26 Thread Matt Thomas

On Aug 26, 2014, at 11:16 PM, Kengo NAKAHARA  wrote:

> It seems good, except return value. IRQ affinity may fail (e.g. when
> all cpus are set "nointr" flag), so return value should not be void.

then we should have a kcpuset_interruptable which is kcpuset_running
minus those cpus which have nointr.

we also need a callback to the interrupt subsystem when intr changes
on a cpu.