Brandon,

For curiousity sake (I have a simular app and am seriously interested
in performance).

- What platform (OS/processor) are you on,

- How did you measure the time to call recvfrom(), or perhaps
  even a more relevant question is how do you use recvfrom()
  (whats the surrounding code like, do you select() first, etc..)?

I ask because I'm seeing substantially better numbers for recvfrom (0.021ms),
although granted it is for a fairly short message but that doesn't explain
the 1000X performance delta.  Recvfrom itself is relatively cheap, select()
is VERY expensive in comparison ( <100K clock cycles versus >23M clock cycles
for select, or ~0.021ms for recv from versus ~9.7ms on a 2.4Ghz Xenon for 
select).  Note also that this is for an empty select set with 0 timeout, so
that is JUST the calling overhead, if its checking fd's it will be greater.
If your on a slower platform (embedded perhaps given your company?) the 
numbers will be obviously be worse, recvfrom seems to scale almost directly
proportional to clock speed.

Based on the numbers I've seen I would expect recvfrom to be at LEAST as fast
if not faster than libpcap since libpcap (often) uses select() (on some
platforms :) to check if the capture device has data ready.  Libpcap
will benefit somewhat from its ability to bundle multiple packets into a
buffer (on platforms that support that), but not I suspect as much to make
up a ~250X performance delta.

I'm also going to agree with Guy.  If you have checksum problems and are
still seeing the packets you are doing something I would seriously like to
find out how you accomplished.  More likely you shouldn't see the 
packets at all.  

Below is a short sample of results and a short overview of my methodology.

This is on a single CPU 2.4Ghz Xenon running RH 9.0 with a stock 2.4.20 kernel.

rdtsc: 1009667286648918 - 1009667286584434 = 64484 ( / one_second) = 0.000021 size 36
rdtsc: 1009679499010456 - 1009679498987684 = 22772 ( / one_second) = 0.000007 size 36
rdtsc: 1009691711003018 - 1009691710985706 = 17312 ( / one_second) = 0.000006 size 36
rdtsc: 1009703923467532 - 1009703923448944 = 18588 ( / one_second) = 0.000006 size 36
rdtsc: 1009716135791420 - 1009716135773620 = 17800 ( / one_second) = 0.000006 size 36
rdtsc: 1009752772553556 - 1009752772447452 = 106104 ( / one_second) = 0.000035 size 36
rdtsc: 1009764984789778 - 1009764984748314 = 41464 ( / one_second) = 0.000014 size 36
rdtsc: 1009777197311290 - 1009777197270442 = 40848 ( / one_second) = 0.000013 size 36
rdtsc: 1009789410262486 - 1009789410220774 = 41712 ( / one_second) = 0.000014 size 36
rdtsc: 1009801622233478 - 1009801622192834 = 40644 ( / one_second) = 0.000013 size 36
rdtsc: 1009813840971578 - 1009813840931126 = 40452 ( / one_second) = 0.000013 size 36
rdtsc: 1009826046989322 - 1009826046959894 = 29428 ( / one_second) = 0.000010 size 36
rdtsc: 1009838259554966 - 1009838259526818 = 28148 ( / one_second) = 0.000009 size 36
rdtsc: 1009850472114698 - 1009850472085270 = 29428 ( / one_second) = 0.000010 size 36

Here is a code snippet that shows how I do my timings:

    while(1) {

        // We wouldn't actually use select in the real app
        // we're using it here to make sure we're timing the
        // call the recvfrom for a live socket instead of 
        // counting the time recvfrom blocks waiting for a packet
        FD_ZERO(&rfds);
        FD_SET(recv_s, &rfds);
        select(recv_s  +1, &rfds, NULL, NULL, NULL);

        // time the actual recvfrom call
        rdtsc_ret[index] = rdtsc();
        index = !index;
        size = recvfrom(recv_s, buf, sizeof(buf), 0, NUL, NUL);
        rdtsc_ret[index] = rdtsc();

        printf("rdtsc: %lld - %lld = %lld ( / one_second) = %f size %d\n", 
rdtsc_ret[index], rdtsc_ret[!index], rdtsc_ret[index] - rdtsc_ret[!index], 
(double)(rdtsc_ret[index] - rdtsc_ret[!index])/(double) one_second, size);
    }

rdtsc is defined as an assembly snippet that reads the processor clock register on
I386 achitectures.  Other architectures are obviously different.
The overhead for calling {rdtsc(); index = !index; rdtsc(); } is 84-96 clock
cycles so I just ignored it for this since its way less than the noise.

extern __inline__ unsigned long long int rdtsc()
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}

One second is defined as the real processor clock speed in HZ.  You need to
figure this out (dmesg, on linux cat /proc/cpuinfo, etc??)

double one_second = 3050.905*1000000;

We use rdtsc because gettimeofday doesn't have enough resolution to accurately
measure a single call like this (resolution on the same machine as above for
gettimeofday is slightly worse than 1ms).

On Sun, May 23, 2004 at 06:37:40PM -0700, Brandon Stafford wrote:
> Hello,
> 
>     I'm writing a server that captures UDP packets and, after some manipulation, 
> sends the data out the serial port. Right now, I'm using recvfrom(), but it takes 20 
> ms to execute for each packet captured. I know that tcpdump can capture packets much 
> faster than 20 ms/packet on the same computer, so I know recvfrom() is running into 
> trouble, probably because of bad checksums on the packets.
> 
>     Is it a good idea to rewrite the server using pcap, or is this likely to slow me 
> down even more?
> 
> Thanks,
> Brandon
> 
> 
> -
> This is the tcpdump-workers list.
> Visit https://lists.sandelman.ca/ to unsubscribe.

-- 
>-=-=-=-=-=-=-<>-=-=-=-=-=-<>-=-=-=-=-=-<>-=-=-=-=-=-<>-=-=-=-=-=-=-<
Ryan Mooney                                      [EMAIL PROTECTED] 
<-=-=-=-=-=-=-><-=-=-=-=-=-><-=-=-=-=-=-><-=-=-=-=-=-><-=-=-=-=-=-=-> 
-
This is the tcpdump-workers list.
Visit https://lists.sandelman.ca/ to unsubscribe.

Reply via email to