So, I ran perf on my host and it came back far more true. The top consumers of time were all atomics and some function called sse3, which I believe is a super fast memcpy implementation provided the the arch. In addition, all the highest time consumers are within my image- it stayed out of the kernel as designed and it used additional extensions and features.
I just thought of something-what if there is some kind of page size difference between my host and my Linux kernel causing the performance problems? On Nov 24, 2016 11:33 AM, "Kenneth Adam Miller" <kennethadammil...@gmail.com> wrote: > > On Thu, Nov 24, 2016 at 11:13 AM, Greg KH <g...@kroah.com> wrote: > > On Thu, Nov 24, 2016 at 10:31:18AM -0500, Kenneth Adam Miller wrote: > >> On Nov 24, 2016 2:18 AM, "Greg KH" <g...@kroah.com> wrote: > >> > > >> > On Thu, Nov 24, 2016 at 02:01:41AM -0500, Kenneth Adam Miller wrote: > >> > > Hello, > >> > > > >> > > > >> > > I have a scheduler issue in two different respects: > >> > > > >> > > 1) I have a process that is supposed to tight loop, and it is being > >> > > given very very little time on the system. I don't want that - I want > >> > > those who would use the processor to be given the resources to run as > >> > > fast as they each can. > >> > > >> > What is causing it to give up its timeslice? Is it waiting for I/O? > >> > Doing something else to sleep? > >> > >> It's multithreaded, so it reads in a loop in one thread and writes in > >> another thread. What I saw when I ran strace on it is each process > >> would run for too long- the program is designed to try and stay out of > >> the kernel on each side, so it checks some shared variables before it > >> ever goes. > > > > So locking/cpu contention for those "shared variables" perhaps? > > I don't think that could possibly be it, because the shared variables > are controlled by atomics. It's just some memory operation to check to > see if it needs to go to the kernel, as in is there more data in the > shm region for me to read? If not, I'll go wait on this OS semaphore. > It's lightening fast on my host machine. > > > > >> > > 2) I am seeing with perf that the maximum overhead at each section > >> > > does not sum up to be more than 15 percent. Total, probably something > >> > > like 18% of cpu time is used, and my binary has rocketed in slowness > >> > > from about 2 seconds or less total to several minutes. > >> > > >> > What changed to make things slower? Did you change kernel versions or > >> > did you change something in your userspace program? > >> > > >> > >> The kernel versions specifically couldnt have anything to do with it > >> but it was different kernels. The test runs in less that 2 seconds on > >> my host. When I copy it to our custom linux, it takes minutes for it > >> to run. I think it's some extra setting that we're missing while > >> building the kernel, and I don't know what that is. I got a huge > >> improvement when I changed the multicore scheduling to allow > >> preemption "(desktop)" but there's still a problem as I've described > >> with one of the processes not using the core as it should. > > > > What do you mean by "custom linux"? Is this the exact same hardware as > > your machine? Or different? If so, what is different? What is > > different between the different kernel versions you are using? Does the > > perf output look different from running on the two different machines? > > If so, where? > > I am building with buildroot a linux that is meant to be really > stripped down and only have the things we want. In my case, the what > the bzImage sees is either what QEMU gives it or what it sees in our > dedicated hardware, with is just off the shelf i7 and other stuff you > get a market - nothing custom in the sense you are thinking. Custom as > in, roll your own linux. > > The kernel versions between my host and the target are 3.13.x and > 3.14.5x; they don't change so much, and certainly don't affect > performance on their own. I'm missing some setting or something with > how I'm configuring or building linux. > > I haven't had a chance to run perf on my host. I can't find what > ubuntu package it is just yet, but I will search for it in a minute. I > have to go somewhere and will be right back immediately. > > > > > Have you changed the priority levels of your application at all? Have > > you thought about just forcing your app to a specific CPU and getting > > the kernel off of that CPU in order so that the kernel isn't even an > > option here at all (Linux allows you to do this, details are somewhere > > in the documentation, sorry, can't remember off the top of my head...) > > > > No, that may be it or help though. I thought that binding an > application to a particular cpu had something to do with affinity and > that there was some C api for it or something. That would work for our > particular scenario, and we've even talked about it, I just don't know > how to do it yet. > > > But really, you should track down what the differences are between your > > two machines/environments, as something is different that is causing the > > slow down. > > True - the kernel configuration is most suspect based on everything I > know. The hardware differences between my host to the target we're > building for is each modern, and well supported by linux. I'm thinking > it absolutely must have something to do with the way I've built linux. > > > > > You haven't even said what kernel version you are using, and if you have > > any of your own kernel patches in those kernels. > > > > For the target hardware is 3.14.5x, and there aren't any kernel > patches at this time; I've disabled grsec while in the process of > narrowing down what the problem is. > > >> > > I think that > >> > > the linux scheduler isn't scheduling it, because this process is just > >> > > some unit tests that double as benchmarks in that they shm_open a file > >> > > and write into it with memcpy's. > >> > > >> > Are you sure that I/O isn't happening here like through swap or > >> > something else? > >> > > >> > >> Well, we're using tmpfs and don't have a disk in the machine, but I > >> will say this process is using all lot of the address space. One > >> problem here is that the kernel has more ram than it thinks it does, > > > > What do you mean, is this a hardware issue? > > I don't think it's hardware; we're using this proprietary software > beneath the linux kernel, but it's still ram of course. I can't say > too too much, but what I can say is that while how much linux thinks > it has could be affecting how it behaves, on our end we have the > resources and can just change the configuration to make sure that > linux sees and has enough ram. So that we can test on our end, and > indeed we will. > > > > >> but what I want to emphasize is that I haven't changed the program to > >> allocate any more than it was previously. I'm not sure if that's a > >> kernel change or some setting, but it went from 85% to 98%. > > > > What exactly went up by 17%? > > Consider the process that I was talking about that is meant to tight > loop and burn on a core to be the "end product process". This is > different from the test benchmarks that I was explaining run so > poorly. > > > > >> The reason > >> why is that there is a large latency even without that big program in > >> there; I can't run my standalone tests in qemu without it also taking > >> minutes. I understand qemu has to emulate, and that's its not just a > >> VM, but I'm going from host CPU to guest, and the settings are the > >> same. > > > > That doesn't really make much sense, why is qemu even in the picture > > here? And no, qemu doesn't always emulate things, that depends on the > > hardware you are running it on, and what type of image you are running > > on it. > > Well, when I'm not at work, I have to be able to run the bzImage on > something, and I don't have a dedicated machine. So I run it in QEMU. > > > > >> > What does perf say is taking all of your time? > >> > >> When I ran perf what it appeared to indicate is that the largest > >> consumer of time was my library, which should be right in either > >> scenario because it should use stay out of the kernel as I've designed > >> it. In addition, the work takes place there anyway, so that's right. > >> What's not right is the fact that the largest percent of time used is > >> around 15%, and all the others combined don't add up to anything near > >> 100. > > > > So perhaps you have other processes running on the machine that you are > > not noticing that is taking up the time slices? Are you _sure_ nothing > > else is running? > > I'm certain that there are other processes alive, but they are not > using the CPU. This process is the only one running. I even gave qemu > "-smp 4" because I want it to behave as close as possible to what it > would if it were just on the host. > > > > > Basically, you have a bunch of variables, and haven't been very specific > > with what really is changing, or even being used here, so there's not > > much specific that I can think of at the moment. > > > > thanks, > > > > greg k-h
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies