Re: How to wake_up the wait_queue of a socket?

2013-01-15 Thread Valdis . Kletnieks
On Mon, 14 Jan 2013 17:50:03 +0800, horseriver said:

When one datagram has reached , How to wake_up the wait_queue of that 
 socket ?

Please clarify your question - I'm not sure which of the following you mean:

1) How does the kernel wake up the waiting process when a datagram
arrives?

2) My kernel is failing to wake up the process, how do I fix it?

3) The kernel is waking the process up, but with high latency and I
want to speed it up.

4) I'm trying to wake up a process for some reason when a datagram arrives
(in which case, you're probably doing something wrong and we need to
discuss what you're trying to achieve)

Let us know in more detail what you wanted to know


pgpxU_sHrUpPE.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Best way to configure Linux kernel for a machine

2013-01-17 Thread Valdis . Kletnieks
On Wed, 16 Jan 2013 17:47:08 +0530, Shraddha Kamat said:
 I normally do the kernel configuration on my machine like this -

 * copy the distro configuration file  to the kernel dir
 * make menuconfig (answer Y's/N's/M's) Normally keep return key pressed
 for default answers
 * then do the actual kernel compilation


 Now, I know that this is not a clean way to do the kernel compilation
 (although it has worked for me for thousands of times that I have
 compiled and successfully booted up with the kernel - without any issues
 - whatsoever !)

 But this time , I am bent upon coming up with a configuration
 specifically targeted to my machine. What is the best way to do this ?

Take your distro kernel, boot it up.  Make sure to insert any USB storage,
webcams, etc, at least long enough for udev to recognize them and load
their driver modules.

Then cd to your kernel source tree and 'make localmodconfig'.

That will build a stripped-down kernel that only builds those modules that
are currently listed in 'lsmod' (which on my laptop is on the order of
1/3 the size of the full Fedora 'allmodconfig').  Which is why it's important
you get all the modules probed - if you don't plug in that USB storage,
the module won't be loaded, so it won't be in lsmod, and won't be included
in your new kernel - at which point you'll use some bad language as you
try to debug why it doesn't work. :)

Also, see the other reply that points at Greg HK's talk.

 Also, while creating a initrd image

 # mkinitrd /boot/initramfs.img 3.8.0-rc3+ -f
 ERROR: modinfo: could not find module ipt_MASQUERADE
 ERROR: modinfo: could not find module iptable_nat
 ERROR: modinfo: could not find module nf_nat
 ERROR: modinfo: could not find module snd_hda_codec_intelhdmi
 ERROR: modinfo: could not find module joydev

 I got the above errors - I know how to resolve these errors , but want
 to understand why in the first place mkinitrd should complain in the
 first place ??

Because if the module was for your keyboard or hard drive or video,
and you got an unbootable kernel as a result, you *really* want to
know at mkinitrd time, not at boot time... :)


pgpErktqN8pyb.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: no error thrown with exit(0) in the child process of vfork()

2013-01-18 Thread Valdis . Kletnieks
On Fri, 18 Jan 2013 19:59:38 +0530, Niroj Pokhrel said:

 I have been trying to create a process using vfork(). And both of the child
 and the parent process execute it in the same address space. So, if I
 execute exit(0) in the child process, it should throw some error right.

Why do you think it should throw an error?

 Since the execution is happening in child process first and if I release
 all the resources by using exit(0) in the child process then parent should
 be deprived of the resources and should throw some errors right ??

No, because those resources that were shared across a fork() or vfork() were in
general *multiple references* to the same resource.

As an example - imagine a flagpole.  You grab it with your hand, you're
now holding it.  You invite your friend to come over and grab it with
his hand - now he's holding it too.

But either one of you can let go of the flagpole - and the other one is
still holding the flagpole until *they* let go.  And the order you let
go doesn't matter in this case - which is important because your example
code has a race condition

Note that there are other cases where the order people let go *does* matter.
This is when you start having to worry about locking order and things like
that.

 In the following code, however the process ran fine even though I have
 exit(0) in the child process 

 #includestdio.h
 #includestdlib.h
 #includesys/types.h
 #includeunistd.h
 int main()
 {
 int val,i=0;
 val=vfork();
 if(val==0)
 {
 printf(\nI am a child process.\n);

Note that printf() gets interesting due to stdio buffering.  You probably
want to call setbuf() and guarantee line-buffering of the output if you're
playing these sorts of games - the buffering can totally mask a real race
condition or other bug.

 printf( %d ,i++);
 exit(0);
 }
 else
 {

/* race condition here - may want wait() or waitpid() to synchronize? */

 printf(\nI am a parent process.\n);
 printf( %d ,i);
 }
 return 0;
 }
 // The program is running fine .
 But as I have read it should throw some error right ?? I don't know what I
 am missing . Please point out the point I'm missing. Thanking you in
 advance.

You're also missing the fact that after the vfork(), there's no real
guarantee of which will run first - which means that the parent can race
and output the 'printf(%d,i) *before* the child process gets a chance
to do the i++.

(Aside - for a while, there was a patch in place that ensured that the
child would run first, on the theory that the child would often do something
short that the parent was waiting on, so scheduling parent-first would just
result in the parent running, blocking to wait, and we end up running the
child anyhow before the parent could continue.  It broke an *amazing* amount
of stuff in userspace because often the child would exit() before the parent was
ready to deal with the child process's termination. Usual failure mode was
the parent would set a SIGCHLD handler, and wait for the signal which never
happened because the SIGCHLD actually fired *before* the handler was set up).

(And on non-cache-coherent systems, it's even possible that the i++ happens
on a different CPU first, and the CPU running the parent process never becomes
aware of it.  See 'Documentation/memory-barriers.txt' in the Linux source
for more info on how this works for data inside the kernel.  This example
is out in userspace, so other techniques are required instead to do cross-CPU
synchronization.


pgp2mx7HAS98g.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: bitops or mutex

2013-01-21 Thread Valdis . Kletnieks
On Mon, 21 Jan 2013 19:16:47 +0530, Prashant Shah said:
 There is a bitmap that needs to be locked across many threads for test
 / set bit operations. Which one is faster - bitops or mutex ?

 1. Bitops :
 set_bit(5, (long unsigned *)tmp);

 2. Mutex :
 mutex_lock(m);
 *tmp = (*tmp) | (1  5);
 mutex_unlock(m);

Do you care about faster as in less latency, or less total cycles
consumed?  The two can be quite different

One uses a mutex, the other a spinlock and irq save/restore.  faster will
depend on the architecture (irqsave is more expensive on some archs than
others) and how heavily the lock is contended.  If the answer *really* matters,
you better go ahead and instrument the code and actually time it and do the
statistical analysis.

Also, double-check that you don't require *additional* locking. It's pretty
rare that the *entire* critical section is exactly one bit-set operation long



pgpAdcNNIa5mv.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Can jiffies freeze?

2013-01-22 Thread Valdis . Kletnieks
On Tue, 22 Jan 2013 10:29:05 -0800, sandeep kumar said:

 I am seeing this problem at the very early in the start_kernel--
 mm_init-- free_highpages, at that time nothing is up and kernel is running
 in single thread.

If you build a kernel with printk timestamps, you'll see that they all
come out like this:

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.8.0-rc3-next-20130117-dirty 
(val...@turing-police.cc.vt.edu) (gcc version 4.7.2 20121109 (Red Hat 4.7.2-9) 
(GCC) ) #49 SMP PREEMPT Thu Jan 17 13:25:28 EST 2013
[0.00] Command line: ro root=/dev/mapper/vg_blackice-root 
log_buf_len=2M vga=893 loglevel=4 threadirqs intel_iommu=off LANG=en_US.UTF-8
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009bbff] usable
[0.00] BIOS-e820: [mem 0x0009bc00-0x0009] reserved
(100 or so more lines with same timestamp)
(now we finish memory init)
[0.00] Dentry cache hash table entries: 524288 (order: 10, 4194304 
bytes)
[0.00] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[0.00] __ex_table already sorted, skipping sort
[0.00] xsave: enabled xstate_bv 0x3, cntxt size 0x240
[0.00] Memory: 4015936k/4718592k available (6266k kernel code, 536744k 
absent, 165912k reserved, 7260k data, 576k init)
(more lines skipped)
[0.00]  memory used by lock dependency info: 5855 kB
[0.00]  per task-struct memory footprint: 1920 bytes
[0.00] hpet clockevent registered
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 2527.012 MHz processor
[0.001004] Calibrating delay loop (skipped), value calculated using timer 
frequency.. 5054.02 BogoMIPS (lpj=2527012)
[0.001009] pid_max: default: 32768 minimum: 301
[0.001100] Security Framework initialized

It probably simply be that your code is running before the clock is started
by the kernel.


pgpWqlU6S1t4c.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Can jiffies freeze?

2013-01-22 Thread Valdis . Kletnieks
On Tue, 22 Jan 2013 11:32:19 -0800, sandeep kumar said:

 as you rightly mentioned,cat /proc/kmsg is showing the time stamps,
 according to that it is 0ms only.
 But when you see the same with UART there is 2sec delay in showing the next
 log. i caught this while i m observing the UART logs with
 Terminaliranicca.

Oh, I could believe there's 2 seconds of time used up there that doesn't
show in kernel timestamps because the timers aren't started yet.

 Since i m early in the mm_init, i cant use watchdog to detect it, hrtimers
 i cant use..i am really thinking how to analyse this delay..

Time for some lateral thinking.. :)

Can you give us some specs on the hardware (in particular, the CPU type/speed
and how much RAM is installed)?  2 seconds on a 2Ghz CPU is about 4 billion
cycles.

Also, are you adding any code into the mm_init path? If so, what exactly
are you doing?

I wonder how early the kernel tracing and profiling stuff is enabled.  It may
be possible to boot a kernel that has function-call tracing enabled, which
would not have timing info, but if you see a function that's being called 500K
times that should only be called a dozen times, that's probably your problem :)
You'd probably want it with 'init=/bin/bash' and dump the stuff, as running to
multiuser will almost certainly roll the buffers and lose the info).



pgpHTz8BL1r2i.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Can jiffies freeze?

2013-01-23 Thread Valdis . Kletnieks
On Wed, 23 Jan 2013 14:05:25 +0800, bill4carson said:

 Hmmm, all the boot messages are routed into a buffer it first printed into
 console, here there is no delay, possible tick timer are not setup yet.
 But when it does get printed into the console, this process could be
 interrupted by other action as well, that's where you see a 2sec delay.

Unlikely, unless Sandeep is running an actual serial console at a very
low speed (which *can* cause fun on large NUMA machines that spew lots
of messages).  I'm pretty convinced that Sandeep is actually seeing a
2 second delay somewhere near mm_init that isn't reflected in the timestamps
because mm_init runs before the clocks are set up.

Of course, it may not be mm_init *itself* that's causing the delay - all
we *really* know is it's somewhere between a printk in mm_init and the
previous printk - there may be something *else* in between that's the
actual time sink.

Sandeep - I admit not having tried it, but can you see if booting with
'initcall_debug' narrows down where your problem is?  If the initcall
stuff is running early enough (I'm not sure when it starts relative to
mm_init), you'll get a message from each initcall as it is entered end
exited.  With any luck, that will help narrow down exactly where your
problem is.


pgpoTc5cSjbfb.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Intercepting a system call

2013-01-25 Thread Valdis . Kletnieks
On Fri, 25 Jan 2013 18:58:29 +0530, Paul Davies C said:

   [1] is the module I wrote for intercepting the system call fork().

Totally skipping over the details of actually doing it - it's usually
considered a Bad Idea to hook a system call, and 98% of the time there's
a much better way to achieve whatever goal you're trying to accomplish
by hooking the syscall.

In other words, why are you trying to do that in the first place?


pgpiMai3e9Sfj.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: locking spinlocks during copy_to_user, copy_from_user

2013-01-25 Thread Valdis . Kletnieks
On Fri, 25 Jan 2013 09:58:42 -0300, Pablo Pessolani said:
 My question is: Is there any know consequence if I enable preemption before
 copy_to_user/copy_from user (keeping the spinlock locked) and then disable
 preemption again after the copy?

Well, at that point, you potentially have a spinlock locked during operations
that can be preempted, which you noted is not recommended.

The generic problem is that while you're spinning, you can get hit with
a preempt, which ends up rescheduling or other fun stuff, and the preempting
thread ends up calling into the same code - at which point you'll possibly
deadlock because the second thread is now blocked on the spinlock that the
first thread holds...

You're much better off either restructuring your code so you don't do
anything that can preempt, or fix your locking in other ways so the
problem can't arise.



pgpIWp_P0z3Mx.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: GRUB question

2013-01-28 Thread Valdis . Kletnieks
On Mon, 28 Jan 2013 06:10:36 +0800, horseriver said:
 On Mon, Jan 28, 2013 at 12:05:35PM +0530, Mandeep Sandhu wrote:
  On Mon, Jan 28, 2013 at 2:07 AM, horseriver horseriv...@gmail.com wrote:
   hi:)
  
 Is /boot/initrd.img a root filesystem? what is the filetype of it?
 
  Yes, it's a rootfs with minimal stuff needed for booting a workable
  system. why does this matter. doing 'file /boot/initrd.img' on my
  system shows its a gzip compressed file.
 
  
 Can I put initrd.img in a floppy to boot system ?
 
  I think you can. Provided you have the floppy driver compiled into your 
  kernel.

And assuming the initrd fits on a floppy (which is actually unlikely - even
without any kernel modules on it, the initrd to get LVM launched comes in
at around 8M.  A default Fedora initramfs is closer to 20M.  Good luck fitting
that on a floppy :)

Of course, an initrd on floppy is kind of silly, because you still need to
find someplace else to fit the actual kernel - which hasn't fit on a floppy
for quite some time.

   Thanks!

   Does this /boot/initrd.img file come out when building kernel ?
   how to build it?

Your system should have either 'mkinitrd' or 'dracut' to build the
initrd image. Some older systems will have 'mkinitramfs'.


pgpcUUj4GCG2s.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: thread concurrent file operation

2013-01-29 Thread Valdis . Kletnieks
On Tue, 29 Jan 2013 16:56:02 +0100, Tobias Boege said:

 Look some lines above:

   struct fd f = fdget(fd);

That creates a reference, not a lock.  It basically assures that
the system doesn't reap and reclaim that fd out from under the code.
(In other words, it's managing lifetime, not concurrency).


pgpedFhCsKnu5.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: thread concurrent file operation

2013-01-29 Thread Valdis . Kletnieks
On Tue, 29 Jan 2013 18:25:19 +0100, Karaoui mohamed lamine said:

 This function is supposed to return the file reference, does do the locking?

Refcounting only, no locking provided by fdget.

 It seems that i can't find the lock instruction( with all those rcu
 instructions, i am little lost), can you guide me throught ?

Because it isn't there. Concurrent writes can happen - that's why lockf()
exists, so that multiple programs that want to scribble on the same file can do
their locking.



pgp1kBIaAjp4k.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Kernel Config for Chromium Browser?

2013-01-31 Thread Valdis . Kletnieks
On Thu, 31 Jan 2013 16:15:45 +0100, Martin Kepplinger said:
 I stripped down my .config for my kernel-compilation a bit, but thought
 that I really just removed unnecessary stuff. But really, the
 consequency was, that the Chromium Browser didn't load _any_ page. Not
 even locally and no chrome:// page. It started, but just stayed at a
 white page.

 I didn't change the system whatsoever. I know it was the kernel. Does
 anyone by chance know what parameter caused that behaviour? What does
 chromium do differently? different than firefox.

There's too many possibilities to count, actually.  It's probably possible
to debug it and figure out *which* thing you missed, but this is probably
a lot faster and more accurate:

1) Boot your distro kernel, which is probably an 'allmodconfig' and will
end up loading a whole pile of modules.
2) Insert all your USB memory sticks, webcams, disk drives, and other
peripherals, at least long enough for udev to see them and load their
respective device drivers.
3) At this point, 'lsmod' should list pretty much every module you actually
use during normal use.
4) cd to wherever you have your kernel source tree, and 'make localmodconfig'.
This will take the output of 'lsmod' and customize the kernel for you.
5) Then proceed to make/make install/reboot and enjoy. :)

Note that in step 4, it *is* possible to miss a kernel module that you
may need in the future (that's why I said to insert all the peripherals,
so their modules get included).  It will usually show up as something
like You add a new rule/option to iptables and it doesn't work or
similar.  At that point, you just have to go enable that missing option.

(It's possible to strip down a distro kernel a *lot* - comparing
the current Fedora Rawhide kernel with the one I have booted now:

[~] grep '=[ym]' /boot/config-3.8.0-0.rc5.git1.1.fc19.x86_64 | wc -l
3741
[~] grep '=y' /boot/config-3.8.0-0.rc5.git1.1.fc19.x86_64 | wc -l
1490
[~] grep '=m' /boot/config-3.8.0-0.rc5.git1.1.fc19.x86_64 | wc -l
2251
[~] grep '=[ym]' /boot/config-3.8.0-rc3-next-20130117 | wc -l
1209
[~] grep '=y' /boot/config-3.8.0-rc3-next-20130117 | wc -l
924
[~] grep '=m' /boot/config-3.8.0-rc3-next-20130117 | wc -l
285

And I could get that 1209 down to well under 900 - there's a few parts
of the kernel (iptables, crypto, and some filesystems) that I mostly just
build just to give it build/test coverage.

Yes, it builds 3 times faster than the Fedora kernel. ;)



pgpyhOeya_qRg.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Android Kernel Compilation

2013-01-31 Thread Valdis . Kletnieks
On Thu, 31 Jan 2013 18:24:01 +0100, Matthias Brugger said:
 2013/1/30 Rahul Gandhi rahul.rahulg...@gmail.com:
  I am trying to compile Kernel for my Android device. I am using the NDK
  Toolchain (arm-linux-androideabi-4.4.3). When I use the defconfig, the
  kernel compiles without any errors but when I flash it onto my device, it
  either gets stuck on the HTC logo or continuously reboots.
  If I pull the config.gz from my device, it gives errors at the tome of
  compilation.
 
  What could have possibly gone wrong?

 first of all, check the kernel logs. that will give you a clue where
 to start digging.

If it hangs on the HTC logo or reboots, his kernel isn't living long enough
for userspace to retrieve the dmesg buffer.

First thing I'd try is a combo of the 3 kernel parameters 'earlyprintk',
'ignore_loglevel', 'initcall_debug' and either serial console or netconsole.
Though it's quite possible that he's dying before even that infrastructure can
give a hint, in which case it gets a lot trickier (and will probably
require some help from the hardware platform in the form of either a JTAG
interface or enough infrastructure to use kgdb or similar tool...)




pgpXFGR0hz73p.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: kernel driver vs userspace program

2013-01-31 Thread Valdis . Kletnieks
On Thu, 31 Jan 2013 13:38:07 -0500, Simon said:
 Hi guys,
   I'm building an electrical device which will be controlled by
 computer.  It will have an embedded microcontroller and will use USB
 to communicate with the PC.  I believe this calls automatically for a
 device driver, correct?  And for using the machine from the PC,
 interacting with it, that calls for a userspace program, correct?  I
 mean, doing things differently, such as all in userspace or all
 in-kernel, would be bad form, right?

My *first* reaction would be do it almost all in userspace and
use libusb to talk to the device from userspace.  Unless there's
weird wonkyness or quirks that have to be handled by a kernel module.

   My question is in case the machine is used in an industrial context
 where there is really only one usage and one kind of interaction that
 follows a pre-determined procedure (therefore totally automated).

There's no reason that an embedded system can't fire up a /sbin/init
that isn't a standard 'init' but is a program to do the process control
needed - in fact, most no-MMU and many embedded systems do that.

 This could give extremely high priority of execution, I guess.

First, see if you're able to meet the timing constraints from a
regular userspace before worrying about going the RT and/or kernel route.
A lot of embedded controllers are amazingly fast and may not need
any extra assistance to make the timing issues.

 Similar to a factory robot controlled by a computer.  Would it make
 more sense to have everything in kernel space, while the userspace (if
 any) would only serve the purpose of reporting?

Especially in the embedded world, there really isn't one right answer.
You'll have to do some trial-and-error to see what balance of userspace
versus kernel is the proper fit for your application.  But in general,
you want to try to keep it in userspace (where things are more protected
in case of a stray pointer, etc) if at all possible.


pgpRNMIUNyRBb.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: open image file

2013-02-05 Thread Valdis . Kletnieks
On Tue, 05 Feb 2013 04:59:42 +0800, horseriver said:
 hi:

   It is not a cpio archive , so that command can not work .

   its file system type is tmpfs.

Umm. No. It's not tmpfs.

tmpfs is a specific ram/swap based filesystem - basically, take enough
4K pages for the size= parameter and do it in memory.  Major user-visible
difference from the older 'ramfs' is that tmpfs pages can move to swap
space, and ramfs pages are nailed down in RAM.

mount -t tmpfs /dev/loop0 /mnt

This never actually looks at /dev/loop0 *at all*.  You could even say this:

mount -t tmpfs none /mnt

and it would work just fine.  Try leaving the '-t tmpfs' off entirely and
let the mount command figure out what type it is, and see if that works
any better for you.


pgpXXmU9SP6cB.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Process exit codes

2013-02-05 Thread Valdis . Kletnieks
On Tue, 05 Feb 2013 14:07:37 +0100, Grzegorz Dwornicki said:

 I guess that there may be a better API that why this thread was created in
 first place. My project goal is to make process checkpoints like cryopid
 had. This is for my thesis and will be GPL for everyone after my
 graduaction. I am researching this subject at this point.

In that case, you want to go look at the checkpoint/restart patches that are
already in the kernel, and in process.  Hint - it's a *lot* harder to do this
right than a thesis project (unless you want to only do a very restricted
subset, like no open files, no TCP connections, etc).  In fact, a lot of
the 'namespace' stuff was added to help support C/R.  For instance, the PID
namespace is there to deal with the fact that if you checkpoint a process with
PID 23974, you need to be able to guarantee that it gets 23974 on restart (as
otherwise you hit problems with getpid() and kill() not referring to the
process you though it did). Of course, this majorly sucks if that PID is
already in use. The solution there is to spawn a new, empty PID namespace to
guarantee that number is available...

https://ckpt.wiki.kernel.org/index.php/Main_Page  is a good place to start.



pgp3emuj6hcV5.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: pr_info not printing message in /var/log/messages

2013-02-05 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 04:43:20 +0800, Jimmy Pan said:

 in fact, i've been always wondering what is the relationship between dmesg
 and /var/log/message. they diverse a lot...

What ends up in /var/log/message is some subset (possibly 100%, possibly 0%)
of what's in dmesg.  Where your syslog daemon routes stuff is a local config
issue - if your syslogd supports it, there's no reason not to dump the iptables
messages in to /var/log/firewall and the rest of it in /var/log/kernel, or
any other policy that makes sense for the sysadmin


pgpY5Nuperpaq.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: When does the /dev/sda1 node comes into being ?

2013-02-05 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 01:26:44 +0800, horseriver said:
   During booting period .every device will have a node at /dev/ folder.
   what is the detail of ths procedure?

'man udev'.  Although the details are a tad murkier for kernels after
2.6.32 that include CONFIG_DEVTMPFS in the config.

Also, note that not all systems will have a /dev/sda1 - that assumes a
partition table on a particular type of disk handled by a specific device
driver.  If that disk has no recognizable partition table, it will just have a
/dev/sda entry.  If the disk is driven by a different driver, you'll see
/dev/hda entries instead.  And if your boot storage device is an SD card
or something, you may have /dev/mmc0 or other entries.

And this:

[~] ls -l /dev/sdre1
brw-rw 1 root disk 133, 385 2012-11-27 05:17 /dev/sdre1

is how I pay the rent. :)  (Bonus points if you can figure out why
my system reports that. :)



pgpUy9q6LoalV.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hard disk dirver

2013-02-05 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 02:53:11 +0800, horseriver said:
   At booting time ,bootloader loads kernel from hard disk too memory.
   During this period,does it need hd driver's support .

Think for a bit - at that point, the hd driver hasn't been loaded
yet, so it *can't* need the hd driver's support.  So instead, the
bootloader has a very dumb simplistic driver that's stripped down
(for instance, no queued command support, one I/O in flight at a
time, very little error handling, read operations only , etc etc)
that's just enough to load the kernel and initrd. (There's often
also a very stupid filesystem driver, just enough to read files.
So for instance 'grub' can find the files for the kernel and initrd.
Some bootloaders are too stupid for even that, and you have to
run a special program to tell the boot loader where all the blocks
of the file are (I'm looking at you, LILO :)

Once the bootloader gets the kernel and initrd loaded, *then* the
kernel can initialize the production driver with all the bells and
whistles needed.


pgpKqjycozpqZ.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hard disk dirver

2013-02-05 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 05:37:41 +0800, horseriver said:
   After grub load kernel and initrd , it get around root filesystem mounting ,
   but failed with no finding root device ,from which kernel and initrd have 
 been
   located .

   ls /dev/ ;  there is no disk device node  .

   Why?

Any number of possible reasons, anything from an improperly configured
kernel, to a misbuilt initrd, to a root= parameter that points someplace
broken, to


pgployuzG7RlR.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hard disk dirver

2013-02-06 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 12:30:37 +0800, horseriver said:

   root = ?   You mean the aasignment at grub command line  ?

For instance, the grub entry for the kernel I'm running right now:

title 3.8.0-rc6-next-20130206
kernel /vmlinuz-3.8.0-rc6-next-20130206 ro 
root=/dev/mapper/vg_blackice-root log_buf_len=2M vga=893 loglevel=4 thre
adirqs intel_iommu=off LANG=en_US.UTF-8
initrd /initramfs-3.8.0-rc6-next-20130206.img

(Strictly speaking, it's not the grub command line, it's the kernel command
line that is passed to the kernel by grub or lilo or grub2 or syslinux
or whatever boot loader happens to float your boat).


pgpBVQ01Bs78a.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hard disk dirver

2013-02-06 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 13:21:17 +0800, horseriver said:
At booting stage,kernel need to detect the hard device before mount it,
does this work  need pci's surport?

That depends.  Is the controller for the hard drive a PCI-based controller? On
most x86-based boxes, it is (and I'm not sure it's even possible to build an
x86 kernel that doesn't have PCI as a =y in the config).  However, very old
units may still have ISA based disk controllers, and other archs may have other
I/O buses.

At loading stage ,boot loader need to move binaries from hard disk 
 partition
to ram,does this work need pci's surport?

Same as above.


pgpRaVaMPFRw6.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Creating scheduler

2013-02-06 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 23:19:26 +0530, jeshkumar...@gmail.com said:
 Can anyone suggest a good tutorial to create our own scheduler ?

Doing an I/O scheduler is pretty trivial, and there's a number of
examples in-tree already to look at.

If you mean a CPU scheduler, the major reason why there's no tutorial
is because writing a non-toy scheduler is *hard*, and by and large
anybody who's a good enough kernel hacker to write a working scheduler
doesn't need a tutorial.

Why is it hard?  Lots of reasons.  Even on a single-core, single-thread
CPU, it's hard to go a good job of picking the next task to run, mostly
because tasks are so damned good at changing behavior.  You decide that
it would be good to run an I/O bound task, so you pick a task that went
into an I/O wait its last 12 times on the CPU - at which point the task
turns around and goes CPU bound crunching all the data it read in the
last 12 times. :)

You also have interactions with thermal issues and frequency governors
(usually, cranking to highest frequency and doing race-to-idle and then
dropping to lowest freq results in the lowest total energy use, but especially
for high-density applications, there may be a upper limit on watts per second
that you can cool, resulting in trade-offs being needed). Then there's cache
affinity issues, balancing load across cores on multi-socket systems, etc etc
etc...



pgp7cA2WQLDsB.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hard disk dirver

2013-02-06 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 13:20:13 -0500, Greg Freemyer said:

 Most new MB's have a SATA controller directly on the MB connected directly
 to either the North or South bridge (I don't know which).

 I don't think any PCI is support needed to talk to the boot disk.

Yes, but said SATA controller and north/southbridge are usually emulating
a PCI:

% lspci
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory 
Controller Hub (rev 07)
00:01.0 PCI bridge: Intel Corporation Mobile 4 Series Chipset PCI Express 
Graphics Port (rev 07)

00:1f.2 RAID bus controller: Intel Corporation 82801 Mobile SATA Controller 
[RAID mode] (rev 03)

So no actual PCI slots involved there, but that PCI bridge is going to require
PCI support to program all the BARs and other stuff to talk to that 82801.


pgpdsBLl341BU.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Creating scheduler

2013-02-06 Thread Valdis . Kletnieks
On Wed, 06 Feb 2013 20:40:47 +0100, Jonathan Neuschäfer said:

 I'm sorry to ask, but don't you rather mean watts than watts per second?

There may indeed be a second order time component involved - for instance, a
cooling system that can handle 10 watts continuously, 20 watts for up to 30
seconds, or 40 watts for 10 seconds max. And of course, 40 watts steady for 10
seconds is different from averaging 40 watts but bouncing between 30 and 50
watts for 10 seconds etc etc..

And of course, there's usually a per-system limit, and per-chip limits, and
your power/cooling budget constraints may force you to go for a higher value
on one to make the budget for the other (burn an extra 0.5 watts in chip A
in order to get Chip B under 0.87 watts type stuff)



pgp7UJ6odCI2Y.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hd controller

2013-02-07 Thread Valdis . Kletnieks
On Thu, 07 Feb 2013 16:19:33 +0800, horseriver said:
 hi:)

I am curious about how hd controller work .
When user am reaing/writing hd ,it was implemented by sending command
to hd controller's special port.Then ,how does the controller know
a new command has received?

In this procedure , what work does the hd driver do ?

You may wish to get a copy of 'Linux Device Drivers, 3rd Edition'
and read it before posting lots of questions here.

A free version is available online, and last I checked it was the very
first hit if you google for 'Linux device drivers.


pgp6PLoWn0vWf.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: pr_info not printing message in /var/log/messages

2013-02-07 Thread Valdis . Kletnieks
On Thu, 07 Feb 2013 23:20:27 +0530, anish kumar said:

 Other insteresting standard logs managed by syslog
 are /var/log/auth.log, /var/log/mail.log.

Other interesting *common* logs, as shipped pre-configured by some distros.

They are hardly a standard (unless the definitions of these
managed to sneak into Posix or the LSB or similar while I wasn't
looking).


pgprdhDTWCP68.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: hd controller

2013-02-07 Thread Valdis . Kletnieks
On Fri, 08 Feb 2013 07:48:39 +0800, Peter Teoh said:

 So the drivers just literally concatenate these command into a string and
 send it over to the device.

The reason that good disk drivers are hard to write is because it isn't
*just* literally concatenating the commands - it also has to do memory
management (make sure that everybody's data ends up in the right buffers),
command queue management, elevator management (if there's multiple I/O
requests pending from userspace, what order do we issue them in?), error
recovery, power management, and a ton of other stuff...


pgpLQscp6zy4D.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: MAX limit of file descriptor

2013-02-11 Thread Valdis . Kletnieks
On Sat, 09 Feb 2013 13:10:47 +0800, horseriver said:

In one process ,what is the max number of opening file descriptor ?
Can it be set to infinite ?

In network programing ,what is the essential for  the maximum of 
 connections
dealed per second

In general, you'll find that number of file descriptors isn't what ends up
killing you for high-performance network programming.  What usually gets you
are things like syn floods (either intentional ddos or getting slashdotted),
because each time you do an accept() on an incoming connection you end up
using userspace resources to handle the connection.  So the *real* question
becomes how many times per second is your box able to fork() off an httpd,
do all the processing required, and close the connection?

A secondary gotcha is that dying TCP connections end up stuck in FIN-WAIT and
FIN-WAIT-2,

And if you're trying to drive multiple 10G interfaces at line speed, it
gets even more fun.  Fortunately, for my application (high performance disk
servers) the connections are mostly persistent, so it's only a problem of
getting disks to move data that fast. :)



pgpnFt8n4QjYM.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: MAX limit of file descriptor

2013-02-11 Thread Valdis . Kletnieks
On Mon, 11 Feb 2013 06:07:38 +0800, horseriver said:

  Actually , my question comes from network performance ,I want to know ,in 
 per second ,the
  maximum of tcp connections that can be dealed with by my server.

That will be *highly* dependent on what your server code does with each 
connection.
A hello world reply and close socket will, of course, go lots faster than
something that has to go contact an enterprise-scale database, do 3 SQL joins,
and format the results.

  How can I do the test and calculate the connection  number , Is it possible 
 that my server
  can deal with 10k tcp connections per second?

10K/sec peaks can be achieved even on a laptop, assuming a dummy do-nothing
service.  Keeping that sustained for a real application will depend on the
service time needed - if you have 20 CPUs in the box, and spread the load
across all 20, you have to average under 2ms to service each request, which
will be a killer if you have to go to disk at all for a request.  At that
point, the guys at Foundry will be more than happy to sell you a load-balancer
so you can have a stack of 10 20-CPU servers each of which only handles 1K/sec
and thus has a 20ms time budget.

  what is the relationship between this and throughput rate?

Lots of tiny connections will totally suck at aggregate throughput, if for no
other reason than TCP slow-start never gets a chance to really open the
transmit window up.  But in general, there is always a trade-off
between transaction rate and throughput.

  Is there document that tells the best optimization of this ?

best is defined by what your application actually needs.  The best
settings for my NFS server will be totally different than what the HTTP
server 12 racks over needs...



pgpwRsaM1hmdj.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: start address of the code segment of the program on x86-64

2013-02-14 Thread Valdis . Kletnieks
On Thu, 14 Feb 2013 15:33:48 +0200, Kevin Wilson said:
 Hi,

 0x08048000 address is the start address of the code segment of a
 program in on x86-32.

More likely, it was the start address of *one particular run* of the
program.  In most kernel configurations, there's something called Address
Space Layout Randomization (ASLR) that makes the code land at different
places each time, to make it harder to write exploits because you can't
hardcode addresses.

 What is the start address of the code segment of the program  on x86-64 ?

 Is there a place in the kernel code where I can add a printk on a
 x86_64 machine to view the code segment
 start ?  How can it be done ?

cat /proc/self/smapsand ponder for a while.  Try it twice and compare
and see if you can see what ASLR does.

You may also want to think about *why* you want to know where the code
segment starts.  If you know what this address is, what do you plan to
use it for?  (In other words, there's probably a different, easier way
to do whatever it is you're trying to accomplish here)...



pgpJpJpeA0Yfq.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: using prefetch

2013-02-15 Thread Valdis . Kletnieks
On Fri, 15 Feb 2013 12:16:02 +0200, Kevin Wilson said:

 Is the prefetch operation synchronous ? I mean, after calling it, are
 we gauranteed that the variable is
 indeed in the cache ?

No, the whole *point* is that it's asynchronous.  You issue the prefetch
several lines of code before you need it to be in cache, so that you
can get several lines of hopefully not data-dependent code to run while
the cache line fetch happens, rather than take a stall when you reference
the variable.  The prefetch may in fact not complete in time, but at
worst you end up just stalling for a cache miss the same as you would have
otherwise.

 According to this logic, anywhere that we want to call skb_shinfo(skb)
 we better do a prefetch before.

No, because most references to skb will be cache-hot because you're in the
middle of the IP stack, which touches the skb struct all over the place, and
therefor it's probably in L2 already.

 In fact, if we prefetch any variable that we want to use then we end up
 with performance boost.

Nope. Not as true as you might think.  If you play around with the 'perf'
command you'll find out that on modern processors you'll see a 98% or so
hit rate on the L2 cache - so 98% of the time you'll *waste* a cycle
issuing the opcode needlessly.

If you look carefully at some of the other structs in the net/ subtree,
you'll see where they've put variables together so that once you reference
one field of the struct, all/most of the needed stuff gets sucked in on
the same cache line.  That's probably more productive than trying to add
prefetch calls all over the place.

 So - any hints, what are the guidlines for using prefetch()?

Only use it if you have good reason to believe that you *will* need
that variable (in other words, it's not in the unlikely half of an if
statement or somehting) *and* there's a good chance that the
variable/memory is cache-cold.


pgpTNSEkPagr9.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: process 0 (swapper)

2013-02-16 Thread Valdis . Kletnieks
On Sat, 16 Feb 2013 18:48:52 +0200, Kevin Wilson said:

  ~0U is not 0 but -1;

-ENOCAFFEINE.

You'd think that after having done kernel-level C programming since the days
of SunOS 3.1.5 and BSD 4.2 I'd k   know better. ;)


pgpBxhyWvlc2R.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Tracing SIGKILL, is that possible?

2013-02-18 Thread Valdis . Kletnieks
On Mon, 18 Feb 2013 15:46:58 -0300, Daniel. said:
 Is there a way to track signals, specially SIGKILL. I would like to
 know if some process dies because reach some resource limit, because
 an OMM error or something likewise..

Depends on where you want the tracking to go.  But your first thing to try
would probably be:

echo 1  /proc/sys/kernel/print-fatal-signals

which controls this code in kernel/signal.c:

static void print_fatal_signal(int signr)
{
struct pt_regs *regs = signal_pt_regs();
printk(%s/%d: potentially unexpected fatal signal %d.\n,
current-comm, task_pid_nr(current), signr);

Bahh.  That's missing a KERN_INFO.  Patch submitted.



pgpRnyvNs8Uih.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Linux Kernel

2013-02-18 Thread Valdis . Kletnieks
On Tue, 19 Feb 2013 10:50:26 +0530, kapil agrawal said:

 How the linux kernel runs in the system after spawning the init and
 mounting the root FS.
 Does it run as some background process ?

No.  You probably want to get some basic knowledge about operating
systems in general.

http://en.wikipedia.org/wiki/Operating_system

as you appear to be confused regarding the basic concepts of an
operating systems kernel.

 How it serves the system calls etc. ?

There's about 5 different answers to that, depending on how in-depth
you want the details, but I suspect that none of them will make any sense
to you until you get a better grasp on the basics


pgp4q_gcdem4X.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Linux Kernel

2013-02-18 Thread Valdis . Kletnieks
On Tue, 19 Feb 2013 12:01:55 +0530, kapil agrawal said:

 Do you mean process with PID 0 is the one, which runs in the background and
 serves the request from userland and goes to cpu_idle() if nothing to run.

No.  Large parts of the kernel run in kernel mode, but using the 'struct task'
of the related userspace process (in particular, most system calls work
this way).  Other large chunks borrow the 'struct task' and run under it
just so there's *a* process running.  And parts aren't in process context
at all, but interrupt context (so they aren't running as process code at all).


pgpN6fkyAc3yG.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: SIGKILL and a sleeping kernel module

2013-02-19 Thread Valdis . Kletnieks
On Tue, 19 Feb 2013 10:37:28 +0200, Kevin Wilson said:
 Hi all,
 I am trying to send a SIGKILL to a kernel module which is sleeping.
 I added a printk after the sleep command.
 Sending a SIGLKILL (by kill -9 SIGLKILL pidOfKernelThread) does **not**
 yield the message from printk(calling do_exit\n);
  which is immediately after the msleep() command, as I expected.

Others have mentioned the various types of sleeping in the kernel, but
overlooked a minor detail.  If a task is in the kernel in a non-interruptible
state, signals are queued and delivered once that status is cleared (which
often doesn't happen until a syscall is about to return to userspace).

The reason this detail is important for would-be kernel hackers:

If one kernel thread manages to BUG() or oops() or otherwise die
or wedge up while holding a lock, other processes can end up blocking
while waiting for the lock.  The problem is that the other processes are
usually in non-interruptible state when they try to take the lock.  The
end result is that you end up with processes that are blocked in the
kernel, and you can't kill -9 them - you're basically stuck with them
until you reboot.  This is why your system will often limp along and
slowly become more and more wedged up after a BUG().

Also - the fact that /bin/ps shows a D or S does *not* in fact mean the
process is in a sleep state inside the kernel.  That's *usually* the case,
but it's quite possible for the code to be actively executing and burning
lots of CPU (often because it's stuck in a loop that's failing to make
forward progress).  The result there is that ps shows a D/S but your
CPU starts getting *very* warm





pgpmgSVJyYlVo.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: cpu_relax(), rep: nop, and PAUSE

2013-02-19 Thread Valdis . Kletnieks
On Wed, 20 Feb 2013 01:58:17 +0700, Mulyadi Santosa said:
 On Tue, Feb 19, 2013 at 7:20 PM, David Shwatrz dshwa...@gmail.com wrote:
  Hi, kernel newbies,
 
  We have:
  #define cpu_relax() asm volatile(rep; nop)
  in arch/x86/boot/boot.h.
 
  Why don't we use the PAUSE assembler instruction here ?

 Just guessing, maybe rep+nop could do better power saving because
 processor is considered as idle.

The 'rep; nop' is actually a placeholder - for some CPUs, a different opcode
gets filled in during boot time.  See arch/x86/kernel/alternative.c and
arch/x86/include/asm/alternative.h for the gory details.


pgpN9SahkMrpX.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: unsubscibe

2013-02-21 Thread Valdis . Kletnieks
On Thu, 21 Feb 2013 15:57:46 +0530, Sandeep Sonawane said:

 Please remove my email id sandeep.sonaw...@gmail.com from this DL.

If your mail software supported RFC2369 mail headers, you would have
seen the following on every posting to the list:

List-id: Learn about the Linux kernel kernelnewbies.kernelnewbies.org
List-unsubscribe: 
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies,  
mailto:kernelnewbies-requ...@kernelnewbies.org?subject=unsubscribe
List-archive: http://lists.kernelnewbies.org/pipermail/kernelnewbies
List-post: mailto:kernelnewbies@kernelnewbies.org
List-help: mailto:kernelnewbies-requ...@kernelnewbies.org?subject=help
List-subscribe: http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

Nice clickable links.


pgpaw4LF9xl3S.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Sending an IP packet

2013-02-22 Thread Valdis . Kletnieks
On Fri, 22 Feb 2013 14:36:17 +0200, Adel Qodmani said:

 My question is quite simple, I have an sk_buff that I want to transmit, the
 sk_buff is an ICMP message and so far, I've built the headers and set up
 everything.

Others have given some details on how.  A better question is why.

Sending an ICMP message without the rest of the IP stack's knowledge is usually
a bad idea, because it can cause the remote end's concept of network state to
become desynchronized with the local concept.  As a quick example, consider a
spurious 'host/port unreachable' sent to the remote end - many IP stacks will
use that info to abort a TCP 3-packet handshake.  However, the rest of *your*
end thinks the connection is still trying to establish.

So what are you trying to accomplish by sending a forged ICMP packet from
within the kernel?  There may be better ways to approach it (for example,
if you're trying to say this port is closed, a better way is to use iptables
with a '-j REJECT --reject-with ', which will (a) do all the heavy lifting
of sending the ICMP for you and (b) also prevent the packet from making it to
the rest of the local IP stack...


pgpdYjQiHPn5d.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Sending an IP packet

2013-02-22 Thread Valdis . Kletnieks
On Fri, 22 Feb 2013 17:15:35 +0200, you said:

 I am trying to implement a new protocol that we've designed which works on
 top of the IP layer, so I am using ICMP messages to carry control
 information for the protocol.
 Why using ICMP, it seemed natural since our protocol is a Network-layer
 protocol and ICMP is a control messages protocol.

In that case, you *really* want to go look at how TCP and SCTP and other
protocols handle ICMP integration.  You want an API that integrates your ICMP
handling with the rest of the protocol stack, because otherwise you'll
end up with an unmaintainable mess.  Also, it will be about 436 times easier
to extend your protocol to work correctly over IPv6. :)

Go look at net/ipv4/udp.c, functions __udp4_lib_err() and __udp_lib_rcv(),
particularly the latter's use of icmp_send().  You'll want to extend icmp_send()
to handle your additional control information.


pgpcNK60D0dRC.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: atomic operations

2013-02-24 Thread Valdis . Kletnieks
On Sun, 24 Feb 2013 11:50:14 +0100, richard -rw- weinberger said:
 On Sun, Feb 24, 2013 at 10:42 AM, Shraddha Kamat sh200...@gmail.com wrote:
  what is the relation between atomic operations and memory alignment ?
 
  I read from UTLK that an unaligned memory access is not atomic
 
  please explain me , I am not able to get the relationship between
  memory alignment and atomicity of the operation.

 Not all CPUs support unaligned memory access, such an access may cause a fault
 which needs to be fixed by the kernel...

There's a more subtle issue - an unaligned access can be split across a cache
line boundary, requiring 2 separate memory accesses to do the read or write.
This can result in CPU A fetching the first half of the variable, CPU B
updating both halves, and then A fetching the second half of the now updated
variable.. This can bite you even on CPUs that support unaligned accesses.



pgpaOFflPKynw.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: barrier()

2013-02-24 Thread Valdis . Kletnieks
On Mon, 25 Feb 2013 12:26:06 +0530, Shraddha Kamat said:
 #define barrier() asm volatile( ::: memory)

 What exactly volatile( ::: memory)  doing here ?

You probably should read Documentation/memory-barriers.txt
in your kernel source tree, and let us know if you still have
questions after that...


pgpYDJoYxqUZ5.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: general_protection result to die

2013-02-26 Thread Valdis . Kletnieks
On Tue, 26 Feb 2013 06:23:34 +0800, horseriver said:
  does general_protection trap necessarily result to die ?

Think for a bit - what other actions can reasonably be taken?  You
hit a GPF, it's obvious that the variables you're working on have
been corrupted, so automatically continuing is probably a Really Bad
Idea.  If there's a debugger involved (gdb/kgdb), you can hand it to
the (presumed) person running the debugger and let *them* figure out
what to do, but that's about the only other realistic option.


pgpCOpw6oSLE8.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: How to measure the RAM read/write performance

2013-02-26 Thread Valdis . Kletnieks
On Tue, 26 Feb 2013 22:35:35 +0700, Mulyadi Santosa said:

 let' see

 what if you do read and write pattern, in certain order so that it
 will be invalidated by the L1/L2/L3 cache everytime?

 AFAIK, one thing for sure, reading data from sequentially and re-read
 them will make end up reading cache in the 2nd operation and so on.

 I think the most certain way to do it is to read data (or write) data
 bigger than total L1/L2/L3 cache.

Of you could just download a copy of memtest+ and run that - I think that
provides some timing info in addition to actually testing your memory.


pgpXhrIunMesq.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Kernel freeze when writing e1000 driver

2013-02-26 Thread Valdis . Kletnieks
On Tue, 26 Feb 2013 11:19:18 -0500, Phani Vadrevu said:
 I am writing a network driver for the e1000 card. While doing the
 receive part, I saw that the kernel freezes whenever it reaches the
 netif_rx(skb) call. I was able to reproduce the same error when using
 a bare bones driver where I hard codde the skb data.

There's a known-working driver in the kernel source tree for this
device already.  Start by looking at what data it's placed in the
skb when it calls that routine, and how it differs from what you filled in.

For bonus points - lose the 'unsigned char t[]' array and replace
it with a bunch of explicit 'skb-foo = bar' statements.  In particular,
that assures that you haven't missed a 0x15 27 bytes into the array,
or failed to allow alignment padding bytes.


pgpCfhqrKUauP.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: How to measure the RAM read/write performance

2013-02-27 Thread Valdis . Kletnieks
On Wed, 27 Feb 2013 15:38:00 +0530, sandeep kumar said:

 In development phase of the board, we are trying to measure RAM performance
 gain while changing type of the RAM.
 The standard benchmark tools are giving us the Cache performance only. So
 we want to try some method to measure RAM performance.

The fact that you can't measure the effect of RAM speed because the L1/2/3
cache masks the effect should tell you something :)

If you are seeing a 98% hit rate or so, RAM speed will indeed not matter
much.  If you're seeing a poor cache hit ratio, you're most likely to get
better performance not by changing the RAM, but changing the application
to improve its cache usage.

And of course, if the application's design is one that is resistant to
improved cache hit ratios, it is important you measure RAM performance
*with that application running*, not a benchmark.  This is because if your
application is managing to thrash the cache, the resulting RAM access
patterns will be *highly* sensitive to actual program behavior, and any
corner cases in the hardware may or may not be hit by the benchmark the
same way the application does.


pgpZ2odHm7mss.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: [ARM_LINUX] ioremap() allowing to map system memory...

2013-03-01 Thread Valdis . Kletnieks
On Fri, 01 Mar 2013 16:48:12 +0530, sandeep kumar said:

 Don't you think it should throw panic()while calling the ioremap() itself.
 Because this sounds like a serious violation...

As you noted, it does give you a warning.

That's a kernel design philosophy - to reserve the panic() and BUG()
calls for cases where it is *known* that proceeding further is
unsafe or impossible.  So the kernel does a panic() if it can't start
/sbin/init at system boot-up - because without that, further progress
is impossible.  But once the system is up, we don't panic if PID 1 goes
away - because it's possible that the user has an open window, and can su
and at least do an orderly shutdown.

Similarly, if a device driver gets confused, the driver code may
do a BUG_ON() and end up locking up that device because to do anything
else may scramble the disk further.  But we don't panic() because that
will basically wedge the system - and the user loses any chance at dumping
the dmesg buffer for debugging or other attempts at an orderly shutdown (in
particular, panic() won't sync the filesystems.  So even though a BUG()
often kills a thread while it holds an important lock, which often leads
to the system eventually deadlocking one process at a time, it's still
a net win if it doesn't panic but lets the user at least try to run sync.

And even BUG_ON() is frowned upon if further progress in a degraded mode
is possible (for instance, a networking error that totally locks up one
TCP connection, but other connections are still working) - at that point,
warn() is the correct thing to do.

As in this case - it *is* a serious violation, but the kernel (a) can
at least possibly keep going and (b) it's at least possible that the
user can recover from it.  There's a *very* good chance that if the
kernel just does a warn(), the user will say *facepalm* Stupid typo
in the address, fix the typo, and re-try with the correct address.

So that's the design philosophy of why it gives you a warning rather than
a panic.


pgp__OqYMSY9W.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: [filesystem] struct of m_inode

2013-03-01 Thread Valdis . Kletnieks
On Sat, 02 Mar 2013 10:28:26 +0800, lx said:

 if (block = 7+512+512*512) because the i_zone[9].
 But the question is why the i_zone[7] can repesent 512 , and i_zone[8] can
 repesent 512*512 ?


Sngle, double, and triple indirect blocks...

http://en.wikipedia.org/wiki/Inode_pointer_structure


pgpk8_k_LWk9q.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Why vmlinux.bin are changed from raw image to elf for x86 ?

2013-03-02 Thread Valdis . Kletnieks
On Sat, 02 Mar 2013 16:36:43 +0800, Jacky said:
 The -O binary is removed. And I don't find any changelog.

A quick course on researching kernel development history...

Step 1:  'git blame arch/x86/boot/compressed/Maekfile'

That gives us the line:

099e1377 (Ian Campbell2008-02-13 20:54:58 + 42) 
OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S

(Fortunately, this is the commit we wanted - figuring out how to get git to
trace through the history if a subsequent commit had touched this line is
left as an exercise for the reader :)

Step 2: 'git log 099e1377' gives us this:

commit 099e1377269a47ed30a00ee131001988e5bcaa9c
Author: Ian Campbell i...@hellion.org.uk
Date:   Wed Feb 13 20:54:58 200

x86: use ELF format in compressed images.

Signed-off-by: Ian Campbell i...@hellion.org.uk
Cc: Ian Campbell i...@hellion.org.uk
Cc: Jeremy Fitzhardinge jer...@goop.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: H. Peter Anvin h...@zytor.com
Cc: Jeremy Fitzhardinge jer...@goop.org
Cc: virtualizat...@lists.linux-foundation.org
Signed-off-by: Ingo Molnar mi...@elte.hu
Signed-off-by: Thomas Gleixner t...@linutronix.de

The one-liner summary matches exactly with what we're interested in,
so it's quite likely the commit we care about.

Step 3:  That's a pretty damned sparse Changelog. Fortunately, that's enough to
feed to Google, and in about 25 seconds, I find this message:

http://www.gossamer-threads.com/lists/linux/kernel/902407

 [PATCHv3 1/3] x86: use ELF format in compressed images.
This allows other boot loaders such as the Xen domain builder the
opportunity to extract the ELF file.


So there's the complete patch, including the things it touched
besides the Makefile, plus the reason for doing it.

Have a nice day.. ;)


pgpYs4RS7JMvT.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: how to trace tcp protocol stack ?

2013-03-03 Thread Valdis . Kletnieks
On Sun, 03 Mar 2013 12:13:51 +0800, ishare said:

Is there mothod to look up the call stack of tcp protocol solution?

ftrace and related functionality.

Note that there is a difference between look up the call stack
and trace the flow of execution.  Consider the following code:

int a ( print(a) } ;
int b { print(b) } ;
int c ( a(); b(); };
int d  { c(); b() };

If you print the call stack in a(), you'll get a c d.

If you trace the flow, you get d c a b b  (plus some returns scattered
in between.

The difference is subtle, but often important.  If you're trying to
figure out how it works, you probably want to trace the flow.  If you're
trying to figure out how the code *got* to function foobar(), you're
looking at a stack trace.

Also, being familiar with the RFCs that define TCP is helpful.  In
particular, the Linux TCP stack will make close to zero sense unless
you're familiar with the state machine defined in RFC793.


pgp_r03CrIkmH.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module compilation error

2013-03-04 Thread Valdis . Kletnieks
On Tue, 05 Mar 2013 09:07:51 +0700, Mulyadi Santosa said:
 On Tue, Mar 5, 2013 at 7:48 AM, Pietro Paolini pulsarpie...@aol.com wrote:

  echo 2  Run 'make oldconfig  make prepare' on kernel src

 try the suggested above step. IIRC, those commands will do things like
 preparing the neccessary object files, headers and so on, so it is
 ready for you to be used on your kernel programming.

In addition, note that 'make oldconfig' followed by 'make prepare' will only
do the right thing and result in a usable module if the source tree matches
your running kernel.  Doing 'make prepare' on a 3.7.2 source tree and then
building a module against it will result in a module that loads in a 3.7.2
kernel with the same .config - but a different .config and/or release will
have anything from a module that simply won't load to one that blows up
the system for mysterious reasons.

It's *highly* recommended that you first learn how to build, install, and
boot a self-compiled kernel (and remember to keep your distro kernel
around), and then once you got that down, *then* start building external
modules against it.


pgpi8vzeqnXOo.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: pthread_lock

2013-03-04 Thread Valdis . Kletnieks
On Tue, 05 Mar 2013 11:02:45 +0530, Mandeep Sandhu said:

 next schedule. I think the waiting threads (processes) will moved from
 the wait queue to the run queue from where they will be scheduled to
 run.

For bonus points, read source code and/or comments and figure out what
Linux does to prevent the 'thundering herd' problem (consider 100 threads
all waiting on the same mutex - if you blindly wake all 100 up, you'll schedule
them all, the first will find the mutex available and then re-take it, and
then the next 99 will get run only to find it contended and go back to
sleep.  So figure out what Linux does in that case. :)


pgp0soR4TXUvS.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Query on skb buffer

2013-03-06 Thread Valdis . Kletnieks
On Wed, 06 Mar 2013 10:39:13 -0800, Kumar amit mehta said:

 Now, if alloc_skb(4096, GFP_KERNEL) is the routine that gets called to 
 allocate
 the kernel buffer then, how does the kernel manages such prospective memory
 allocation failures and how kernel manages large packet requests from the
 application.

Did you actually look at the source for use of alloc_skb() and how it
handles error returns?

(Hint - the kernel doesn't do the same thing at every use of alloc_skb(),
because an allocation failure needs to be handled differently depending on
where it happens.  At some places, just bailing out and dropping the packet
on the floor without any notification to anybody is appropriate.  At other
places, we need to propagate an error condition to the caller).

Typical pattern (from net/core/sock.c:)

/*
 * Allocate a skb from the socket's send buffer.
 */
struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 gfp_t priority)
{
if (force || atomic_read(sk-sk_wmem_alloc)  sk-sk_sndbuf) {
struct sk_buff *skb = alloc_skb(size, priority);
if (skb) {
skb_set_owner_w(skb, sk);
return skb;
}
}
return NULL;
}
EXPORT_SYMBOL(sock_wmalloc);

and then the caller does something like this (net/ipv4/ip_output.c,
in function __ip_append_data():

 } else {
skb = NULL;
if (atomic_read(sk-sk_wmem_alloc) =
2 * sk-sk_sndbuf)
skb = sock_wmalloc(sk,
   alloclen + hh_len + 
15, 1,
   sk-sk_allocation);
if (unlikely(skb == NULL))
err = -ENOBUFS;
else
/* only the initial fragment is
   time stamped */
cork-tx_flags = 0;
}
if (skb == NULL)
goto error;




pgpERJxEr0q7W.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Several unrelated beginner questions.

2013-03-06 Thread Valdis . Kletnieks
On Wed, 06 Mar 2013 18:19:09 -0500, Konstantin Kowalski said:

 1.) Currently, I am reading 2 books about Linux kernel: Linux Device
 Drivers (3rd edition) and Linux Kernel Development (3rd edition).

 I like both books and I am learning a lot from them.

 I heard that both of this books are outdated, but so far all the
 information in this books seems valid and applicable. Is there better
 books you would recommend?

They're both still mostly applicable.  The concepts listed are still
valid - certain things need to be locked at certain times, things have
lifetimes, and so on.  The outdated is mostly places where the API has
changed slightly - for instance, where api_foo(struct bar *a, struct baz *b)
is now api_quux(struct bar *a, struct baz *b, int blat).  So you can't
cut-n-paste the code and expect it to still work.

 2.) In Linux Device Drivers, it states that module_exit(function) is
 discarded if module is built directly into kernel or if kernel is
 compiled with option to disallow loadable modules. But what if the
 module still has to do something during shutdown? Releasing memory is
 unimportant since it does not persist over reboot, but what if the
 module has to write something to a disk file, or do some other action?

If your module has allocated 128M for a graphics buffer, you'll think
releasing memory is important. :)

Strictly speaking, a module *should* have already been quiesced and
taken care of business before module_exit() is called - there shouldn't
be much of anything left to do at that point.

(Hint - this is exactly the same question as why is an empty -release()
function considered a Bad Thing - it's because release() and similar are
supposed to do the clean-up before the module exits)

 3.) What's the deal with different kernel versions? I heard back in the
 2.x days, even kernels were stable and odd versions were experimental,
 but with 2.6 it changed.

 So with 3.x kernels, are all of them experimental in the beginning and
 stable in the end? Also, with 3.x new versions seem to be released more
 often than in 2.1-2.5 days. Did the release cycle get smaller or is it
 just my imagination? Also, what does rc number mean?

The 3.x series is exactly the same policy as 2.6 was - Linus just decided
that 2.6.42 was too much and reset the counter, and he's been holding
to pretty close to every three months for releases for all that time.

And 2.1 got up to 2.1.142 or something insane like that in fewer years than it
took 2.6 to get to .42, so it isn't like releases are more frequent these days
:)

 4.) Currently, I am running linux-next, and it works great. Am I correct

Lucky you.  I manage to break at least 2-3 things in linux-next per release
cycle. ;)

 to assume that linux-next is supposed to have newest, shiniest and most
 unstable features? `uname -a` says that I am still running 3.8-next, but
 there is already 3.9 out. So which version is more experimental and
 least stable? Which one is the newest?

Do another pull of the linux-next tree, it will say you're on 3.9-rc1-next now.
And even when it said 3.8-next, that was already 3.8 plus all the patches
queued for 3.9.  Now that Linus's tree is at 3.9-rc1, (closing the merge
window for major additions for 3.9) people will be dumping 3.10 material into
the linux-next tree.

 5.) How exactly does make/.config work? When I run `make oldconfig`,
 does it use the everything from the previous .config and only ask how to
 configure new features?

Yes, that's what *should* happen.

  And when I run `make` does it re-use old object
 files if nothing was changed in the specific file, or does it re-compile
 everything from scratch?

Try it and see. :)  Note that sometimes, an apparently innocuous config change
can result in the rebuild of lots of files.  This is because some commonly used
.h file has a #ifdef CONFIG_FOO in it - and when you change FOO, then everybody
that includes that .h (even indirectly) ends up rebuilding.

But in general, if you touch only 1 or 2 .c files and no widely used .h files,
you'll just have to rebuild those .c's if they're modules.  If they're kernel
builtins, there's another 10 or 12 things that have to happen, but it's still
a lot faster than a full rebuild.


pgpmLCIVkUM6W.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: zap_low_mappings

2013-03-06 Thread Valdis . Kletnieks
On Thu, 07 Mar 2013 10:33:18 +0800, ishare said:

   kernel halts because the page mapping has been modified by zap_low_mappings


   why we should do zap_low_mappings in init procedure ? this will disorder 
 the page mapping.

You might want to get yourself an up to date kernel, as the code you're
asking about was removed almost 2 1/.2 years ago.

zap_low_mappings was removed in October 2010 by this commit:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/mm/init_32.c?id=b40827fa7268fda8a62490728a61c2856f33830b

x86-32, mm: Add an initial page table for core bootstrapping

This patch adds an initial page table with low mappings used exclusively for
booting APs/resuming after ACPI suspend/machine restart. After this, there's no
need to add low mappings to swapper_pg_dir and zap them later or create own
swsusp PGD page solely for ACPI sleep needs - we have initial_page_table for
that.

Signed-off-by: Borislav Petkov b...@alien8.de
LKML-Reference:20101020070526.ga9...@liondog.tnic
Signed-off-by: H. Peter Anvin h...@linux.intel.com




pgpmZjAlzdolh.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: zap_low_mappings

2013-03-06 Thread Valdis . Kletnieks
On Thu, 07 Mar 2013 11:43:43 +0800, ishare said:

   set_pgd(swapper_pg_dir+i, __pgd(0));

 If I have not define CONFIG_X86_PAE ,then the low mem will be invalided all .

And what makes you think that call invalidates *all* the page
mappings?


pgpwj47K1DPrA.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Disabling interrupts and masking interrupts

2013-03-07 Thread Valdis . Kletnieks
On Thu, 07 Mar 2013 17:17:19 +0200, Kevin Wilson said:

 Does this mean that once you are disabling
 interrupts, these interrupts are lost ?  even later, when we will
 enable interrupts, the interrupts from the past that should have been
 created (but interrupts were disabled at that time interval) are in
 fact lost?

Level-triggered interruots will go off once interrupts are re-enabled,
assuming that the device has kept the level set and not given up and timed
out.

Edge-trittered interrupts are gone.  That's part of why most hardware
doesn't use edge triggers - it's just too hard to guarantee proper device
driver operation.

Also, in common usage, disabled interrupts means that you're not listening
to *any* interrupts, while masked means we're not listening to *this*
interrupt source, even if we *are* accepting interrupts from other sources.

The difference is that sometimes the CPU is doing stuff that it would be
potentially screwed if *any* interrupt happened, so we disable them.  Other
times we're busy inside a device driver, and we're in a critical section
for that device - but it's safe for other devices to interrupt.  So to improve
latency we mask off just the one interrupt not all of them.


pgpceuS7ZHI2D.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Disabling interrupts and masking interrupts

2013-03-07 Thread Valdis . Kletnieks
On Thu, 07 Mar 2013 09:28:58 -0800, Dave Hylands said:

 In my experience, edges triggered interrupts are always latched by the HW
 when they arrive. If another edge comes along between the initial edge and
 the time that the interrupt is cleared, then this second edge is lost. The
 fact that an interrupt is pending will still be retained though, and as
 soon as interrupts are enabled, then the interrupt handler will fire.

Actually, what you're describing there is hardware that converts edge
triggered to level triggered precisely because edge triggered stuff
sucks otherwise. ;)

  Also, in common usage, disabled interrupts means that you're not
  listening
  to *any* interrupts, while masked means we're not listening to *this*
  interrupt source, even if we *are* accepting interrupts from other
  sources.

 Normally disabling interrupts is just another form of masking, it just
 happens to mask all of the interrupts rather than one particular one. Even
 when you disable interrupts, you typically still have access to the
 unmasked interrupt state.

Yes, but it;'s still useful to distinguish between the two cases.  Also,
on many hardware architectures, the actual code to 'disable all' and 'disable
one' is very different (on X86, 'cli' does all very fast, ignoring exactly
one takes some more doing)


pgpagp8PH3VR0.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Userspace interception of locally valid memory location

2013-03-10 Thread Valdis . Kletnieks
On Sun, 10 Mar 2013 19:27:52 +0530, harish badrinath said:
 Is it possible to intercept (both read and write) a locally valid
 address of a process and replace it with our own values (it is for a
 transparent distributed shared memory project).

Go look at how gdb traces variables.



pgpWt5_GYt4_l.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Userspace interception of locally valid memory location

2013-03-10 Thread Valdis . Kletnieks
On Sun, 10 Mar 2013 19:27:52 +0530, harish badrinath said:
 Hello,
 Is it possible to intercept (both read and write) a locally valid
 address of a process and replace it with our own values (it is for a
 transparent distributed shared memory project).

(Damn, hit send too soon)

Go look at how gdb traces variables.  Note that method pretty much only
works for writes to a variable, and has some performance implications.

Tracing reads is more difficult, and will probably end up being dependent
on exactly how good the hardware debugging support is - the S/390 architecture
has had the Program Event Recording feature since the 70s, and recent x86
chipsets have had similar features - details such as how many tracepoints
you can have active, how much memory each one can cover, and whether you can
intercept an event before it completes will be dependent on the arch and CPU -
what's true for a old Pentium4 won't be true for an i7, and ARM is a whole
different beast.


pgp5YJ78qG_Tx.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: 64bit MMIO access

2013-03-10 Thread Valdis . Kletnieks
On Sun, 10 Mar 2013 20:35:37 +0100, Jagath Weerasinghe said:
 readq and writeq do the job.

Please double-check how those are implemented on your architecture. I seem
to remember that on some systems, readq and writeq may not be atomic and
may become two bus cycles.  And some hardware cares about that.

It was a discussion on linux-kernel maybe a year or so ago...


pgpacFTwAj9hl.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: 64bit MMIO access

2013-03-10 Thread Valdis . Kletnieks
On Sun, 10 Mar 2013 20:35:37 +0100, Jagath Weerasinghe said:
 Hi,

 readq and writeq do the job.

(hit send too soon)  Also, the read/write [bwlq] functions refer to the width
of the *data*, not the address



pgpRFJDCHascC.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: User space memory

2013-03-12 Thread Valdis . Kletnieks
On Tue, 12 Mar 2013 18:38:05 +0530, Prabhu nath said:

 On Sun, Mar 10, 2013 at 11:30 PM, Christoph Seitz c.se...@tu-bs.de wrote:

  I use a char device for reading and writing from/to a pcie dma card.
  Especially the read function makes me some headache. The user allocates
  some memory with posix_memalign and call the read function on the
  device, so that the devices knows where to write to. My driver now uses
  get_user_pages() to pin the user pages. The memory has never been
  written or read by the user, so it's not yet in the RAM, right? And
  get_user_pages returns a valid number of pages, but for every page the
  same struct. (respectively the same pointer). Is there any way to ensure
  that the user pages are in the ram and get_user_pages returns a valid
  page array?
 

 If you know the RAM physical address range you can figure out by doing the
 following
 *page_to_pfn(page_ptr)  12*;
 where page_ptr is a struct page * returned by get_user_pages().
* page_to_pfn()* will return the pfn of the corresponding page frame and
 left shifting by 12 bits will give you page frame base address.

Unfortunately, that doesn't actually tell you what Christoph was
worried about - is the page *currently* in RAM?  For that, you need
to check some bits in the pfn once you find it.

Also, note the following:

It's not always 12, because not everything uses a 4K page - consider hugepage
support, or Power and Itanium where the pages are bigger and often several
different sizes are supported.  There's an API for the current page size. Use
it. :)

Also, there's an API for pinning pages so they *stay* in RAM so you can target
them for I/O.  Use that. ;)


pgpNyoHnvwGNr.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: User space memory

2013-03-12 Thread Valdis . Kletnieks
On Tue, 12 Mar 2013 15:03:53 +0100, Christoph Seitz said:

 I found out, if I use the force flag with get_user_pages, the pages get
 faulted, but there has to be a nicer way than using the force flag.

Why does there have to be a nicer way?  Maybe you already got the nice way.

(Hint - why does the force flag even exist? :)


pgplpaBBPv2se.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: block mailing list?

2013-03-15 Thread Valdis . Kletnieks
On Fri, 15 Mar 2013 06:22:14 -0700, Raymond Jennings said:
 Is there a kernel list dedicated to discussion of block devices?

What's to discuss?  There probably isn't enough ongoing traffic to
support a separate mailing list (we got too many of them as it is :)

MAINTAINERS says:

BLOCK LAYER
M:  Jens Axboe ax...@kernel.dk
T:  git git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
S:  Maintained
F:  block/

Which basically boils down to ask linux-kernel and/or kernelnewbies if
you don't understand, and cc: Jens if you managed to break it :)


pgpvl_k99Fd8e.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: signals handling: kill() successful, but nothing delivered

2013-03-18 Thread Valdis . Kletnieks
On Mon, 18 Mar 2013 06:50:25 +0100, mic...@michaelblizek.twilightparadox.com 
said:
 Hi!

 On 06:52 Fri 08 Mar , mic...@michaelblizek.twilightparadox.com wrote:
 ./a.out `ps a|grep wget|grep -v grep

To save the double grep, you can do something like this:

ps a | grep '[w]get' | ...

Figuring out why that works is left as an exercise for the reader...


pgpJxOfEdOhxH.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: programme header

2013-03-18 Thread Valdis . Kletnieks
On Tue, 19 Mar 2013 09:38:49 +0800, ishare said:

   I am linking my kernel by a link script. its contens is as below:

  I think it will work ,but ld report that No enough room for programme 
 header,what is the reason?
  what should I do ?

The first thing you do is ask yourself why you're using a link script
of your own, when most architectures come with a working link script already.

The second thing you do is *read the script* - there's a big copmment in there:

  /* This linker script is used both with -r and with -shared.
 For the layouts to match, we need to skip more than enough
 space for the dynamic symbol table et al.  If this amount
 is insufficient, ld -shared will barf.  Just increase it here.  */

Hope that helps.


pgpNxO3hcuU3n.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: programme header

2013-03-18 Thread Valdis . Kletnieks
On Tue, 19 Mar 2013 12:44:36 +0800, ishare said:

   because I need to generate a .so for sysenter used

And that solves what problem for you, exactly?

Consider that most architectures that use sysenter manage to do so
without having to worry about a .so for it (or if they really do need
one, they already create a .so).  So what problem are you trying to
solve by using a .so?


pgplXadb05EFh.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Memory allocations in linux for processes

2013-03-19 Thread Valdis . Kletnieks
On Tue, 19 Mar 2013 20:41:55 +0530, Niroj Pokhrel said:

 #includestdio.h
 int main()
 {
 while(1)
 {
 }
 return 0;
 }

 I don't understand where does mmap or malloc come in to play in this code.

Unless you linked it statically, a lot of stuff happens before you ever
get to main() - namely, any shared library linking and mapping.  Run
strace on your binary and see how many system calls happen before you hit
the infinite loop.


pgpjT3sQt6KU0.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: kernel build error

2013-03-20 Thread Valdis . Kletnieks
On Wed, 20 Mar 2013 00:07:57 -0700, Kumar amit mehta said:

 I forgot that 'uname -m' will return me the kernel version and _not_ the CPU
 architecture. The CPU on my machine seem to be 64 bit (/proc/cpuinfo|grep 
 flags
 shows 'lm'). So my understanding is that I've a 32 bit kernel running on a 64
 bit machine.

Or more correctly, you have a kernel actually running in 32-bit mode on
a machine that is 64-bit capable.


pgppq034nqkqM.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: BFQ: simple elevator

2013-03-20 Thread Valdis . Kletnieks
On Thu, 21 Mar 2013 02:24:23 +0700, Mulyadi Santosa said:

 pardon me for any possible sillyness, but what happen if there are
 incoming I/O operation at very nearby sectors (or perhaps at the same
 sector?)? I suppose, the elevator will prioritize them first over the
 rest? (i.e starving will happen...)

And this, my friends, is why elevators aren't as easy to do as the average
undergrad might hope - it's a lot harder to balance fairness and throughput
across all the corner cases than you might think.  It gets really fun
when you have (for example) a 'find' command moving the heads all over
the disk while another process is trying to do large amounts of streaming
I/O.  And then you'll get some idiot process that insists on doing the
occasional fsync() or syncfs() call.  Yes, it's almost always *all*
corner cases, it's very rare (unless you're an embedded system like a Tivo)
that all your I/O is one flavor that is easily handled by a simple elevator.




pgpwhjtDXJzNR.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: BFQ: simple elevator

2013-03-20 Thread Valdis . Kletnieks
On Wed, 20 Mar 2013 14:41:31 -0700, Raymond Jennings said:

 Suppose you have requests at sectors 1, 4, 5, and 6

 You dispatch sectors 1, 4, and 5, leaving the head parked at 5 and the
 direction as ascending.

 But suddenly, just before you get a chance to dispatch for sector 6,
 sector 4 gets busy again.

 I'm not proposing going back to sector 4.  It's behind us and (as you
 indicated) we could starve sector 6 indefinitely.

 So instead, because sector 4 is on the wrong side of our present head
 position, it is ignored and we keep marching forward, and then we hit
 sector 6 and dispatch it.

 Once we hit sector 6 and dispatch it, we do a u-turn and start
 descending.  That's when we pick up sector 4 again.

The problem is that not all seeks are created equal.

Consider the requests are at 1, 4, 5, and 199343245.  If as we're servicing
5, another request for 4 comes in, we may well be *much* better off doing a
short seek to 4 and then one long seek to the boonies, rather than 2 long
seeks.

My laptop has a 160G Western Digital drive in it (WD1600BJKT).  The minimum
track-to-track seek time is 2ms, the average time is 12ms, and the maximum is
probably on the order of 36ms. So by replacing 2 max-length seeks with a
track-to-track seek and 1 max-length, you can almost half the delay waiting
for seeks (38ms versus 72ms). (And even better if the target block is logically 
before the current one, but
still on the same track, so you only take a rotational latency hit and no seek
hit.

(The maximum is not given in the spec sheets, but is almost always 3 times the
average - for a discussion of the math behind that, and a lot of other issues,
see:

http://pages.cs.wisc.edu/~remzi/OSFEP/file-disks.pdf

And of course, this interacts in very mysterious ways with the firmware
on the drive, which can do its own re-ordering of I/O requests and/or
manage the use of the disk's onboard read/write cache - this is why
command queueing is useful for throughput, because if the disk has the
option of re-ordering 32 requests, it can do more than if it only has 1 or
2 requests in the queue.  Of course, very deep command queues have their
own issues - most notably that at some point you need to use barriers or
something to ensure that the metadata writes aren't being re-ordered into
a pattern that could cause corruption if the disk lost its mind before
completing all the writes...

 In my case I'm just concerned with raw total system throughput.

See the above discussion.


pgpldeK4nVfHY.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Linux elevators (Re: BFQ: simple elevator)

2013-03-20 Thread Valdis . Kletnieks
On Wed, 20 Mar 2013 16:05:09 -0700, Arlie Stephens said:
 The ongoing thread reminds me of a simple question I've had since I
 first read about linux' mutiple I/O schedulers. Why is the choice of
 I/O scheduler global to the whole kernel, rather than per-device or
 similar?

They aren't global to the kernel.

On my laptop:

# find /sys/devices/pci* -name 'scheduler' | xargs grep .
/sys/devices/pci:00/:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop
 deadline [cfq]
/sys/devices/pci:00/:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:noop
 deadline [cfq]
# echo noop | 
/sys/devices/pci:00/:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/schedule
# find /sys/devices/pci* -name 'scheduler' | xargs grep .
/sys/devices/pci:00/:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop
 deadline [cfq]
/sys/devices/pci:00/:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:[noop]
 deadline cfq

I just changed the scheduler for the CD-ROM.




pgp0_KZpObd65.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: BFQ: simple elevator

2013-03-21 Thread Valdis . Kletnieks
On Wed, 20 Mar 2013 16:37:41 -0700, Raymond Jennings said:

 Hmm...Maybe a hybrid approach that allows a finite number of reverse
 seeks, or as I suspect deadline does a finite delay before abandoning
 the close stuff to march to the boonies.

Maybe. Maybe not.  It's going to depend on the workload - look how many times
we've had to tweak something as obvious as cache writeback to get it to behave
for corner cases.  You'll think you got the algorithm right, and then the next
guy to test-drive it will do something only 5% different and ends up cratering
the disk. :)

Now of course, the flip side of a disk's average seek time is between 5ms and
12ms depending how much you paid for it is that there's no spinning disk on
the planet that can do much more than 200 seeks per second (oh, and before you
knee-jerk and say SSD to the rescue, that's got its own issues). Right now,
you should be thinking so *that* is why xfs and ext4 do extents - so we can
keep file I/O as sequential as possible with as few seeks as possible. Other
things you start doing if you want *real* throughput: you start looking at
striped and parallel filesystems, self-defragmenting filesystems,
multipath-capable disk controllers, and other stuff like that to spread the I/O
across lots of disks fronted by lots of servers. Lots as in hundreds.   As in
imagine 2 racks, each with 10 4U shelves with 60 drives per shelf, with some
beefy DDN or NetApp E-series heads in front, talking to a dozen or so servers
in front of it with multiple 10GE and Infiniband links to client machines.

In other words, if you're *serious* about throughput, you're gonna need a
lot more than just a better elevator.

(For the record, a big chunk of my day job is maintaining several several
petabytes of storage for HPC users, where moving data at 3 gigabytes/second
is considered sluggish...)



pgp66ugnedhkF.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: optimization in kernel compile

2013-03-22 Thread Valdis . Kletnieks
On Fri, 22 Mar 2013 13:41:25 +0800, ishare said:
   Is it  needed or must to compile fs and driver with -O2 option when 
 compiling kernel ?

It's not strictly mandatory to use -O2 (for a while, -Os was the default). There
are a few places that for correctness, you *cannot* use -O0. For instance, a
few places where we use builtin_return_address() inside an inline (-O0
won't inline so builtin_return_address() ends up returning a pointer to
a function when we want the function's parent).

Since gdb and friends are able to deal with -O2 compiled code just fine,
there's really no reason *not* to optimize the kernel.


pgpXTcS9iUOyU.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: optimization in kernel compile

2013-03-22 Thread Valdis . Kletnieks
On Fri, 22 Mar 2013 22:32:40 +0800, ishare said:

  are a few places that for correctness, you *cannot* use -O0. For instance,
a
  few places where we use builtin_return_address() inside an inline (-O0
  won't inline so builtin_return_address() ends up returning a pointer to
  a function when we want the function's parent).

   So it will cause an error ?

Yes, there are places where failing to optimize causes errors.

Consider this code:

static inline foo (return builtin_return_address());

int bar ( x = foo());

If you don't optimize, x ends up with a pointer into bar.  If it
gets inlined because you're optimizing, x ends up pointing to bar's caller.
This breaks stuff like the function tracer.

  Since gdb and friends are able to deal with -O2 compiled code just fine,
  there's really no reason *not* to optimize the kernel.

   the debug information will be stripped  by  -O2 ,for example ,you can not 
 touch

No debug information is stripped by -O2.  Debug information isn't emitted if
you don't compile with -g.  At one time, long ago (quite possibly literally
before you were born for some of the younger readers on the list), gcc was
unable to generate -g output if the optimizer was invoked.  But that was
last century (gcc 2.95 era).

   the value of  some varibles at stack , and debugging will not run line by 
 line,
   instead , the source jump in unexpectable order .

I'm probably going to piss a bunch of people off by saying this, but:

If your C skills aren't up to debugging code that's been compiled with
-O2, maybe you shouldn't be poking around inside the kernel.


pgp2l_ihNO3sv.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: optimization in kernel compile

2013-03-22 Thread Valdis . Kletnieks
On Fri, 22 Mar 2013 10:52:56 -0400, valdis.kletni...@vt.edu said:

 No debug information is stripped by -O2.  Debug information isn't emitted if
 you don't compile with -g.  At one time, long ago (quite possibly literally
 before you were born for some of the younger readers on the list), gcc was
 unable to generate -g output if the optimizer was invoked.  But that was
 last century (gcc 2.95 era).

GCC 4.8 was officially released today (since I sent the previous note).

From the release notes:

A new general optimization level, -Og, has been introduced. It addresses the
need for fast compilation and a superior debugging experience while providing a
reasonable level of runtime performance. Overall experience for development
should be better than the default optimization level -O0.

The current Linus tree does build with 4.8.  I do *not* know if earlier
releases build correctly (or how far back), nor if -Og is sufficient
optimization to allow correct kernel functioning.  But it's something to
look at.


pgpvT83zc9Q26.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: BFQ: simple elevator

2013-03-22 Thread Valdis . Kletnieks
On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:

 The first heap would be synchronous requests such as reads and syncs
 that someone in userspace is blocking on.

 The second is background I/O like writeback and readahead.

 The same distinction that CFQ completely makes.

Again, this may or may not be a win, depending on the exact workload.

If you are about to block on a userspace read, it may make sense to go ahead
and tack a readahead on the request for free - at 100MB/sec transfer and 10ms
seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
seeks later, you're willing.  Of course, the *real* problem here is that how
much readahead to actually do needs help from the VFS and filesystem levels -
if there's only 600K more data before the end of the current file extent, doing
more than 600K of read-ahead is a loss.

Meanwhile, over on the write side of the fence, unless a program is
specifically using O_DIRECT, userspace writes will get dropped into the cache
and become writeback requests later on.  So the vast majority of writes will
usually be writebacks rather than syncronous writes.

So in many cases, it's unclear how much performance CFQ gets from making
the distinction (and I'm positive that given a sufficient supply of pizza and
caffeine, I could cook up a realistic scenario where CFQ's behavior makes
things worse)...

Did I mention this stuff is tricky? :)



pgpwattaBMg37.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: initramfs.cpio

2013-03-24 Thread Valdis . Kletnieks
On Sun, 24 Mar 2013 16:27:27 +0800, ishare said:

  Hi :
 I find that my initramfs_data.cpio generated by gcc does not contain init
 files ,which should be
 executed by terminal initialization.

 My initramfs_data.cpio only contains these :  /dev  /dev/consol  /root .

 where to search the init file?

You want to use mkinitramfs or mkinird or dracut or whatever your distro calls
it to create one.  gcc has no *clue* what needs to go in the initramfs (for
example, if your root file system is on a LVM partition on a LUKS-encrypted
disk, the initramfs has to do a 'cryptsetup openLuks' and then an 'lvm varyon'
before the mount will succeed).  Plus any modprobes that may or may not be
needed, etc etc etc.



pgpXw2SDf5840.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: cap on writeback?

2013-03-25 Thread Valdis . Kletnieks
On Mon, 25 Mar 2013 16:33:48 -0700, Raymond Jennings said:
 Just curious, is there a cap on how much data can be in writeback at
 the same time?

 I'm asking because I have over a gigabyte of data in dirty, but
 during flush, only about 60k or so is in writeback at any one time.

Only a gigabyte? :)  (I've got a box across the hall that has 2.6T of RAM,
and yes, it's pretty sad when it decides it's time for writeback across
an NFS or GPFS mount, even though it's a 10GE connection.)

For the record, writeback is one of those things that's really hard to
get right, because there's always corner cases.  Probably why we seem to
end up screwing around with it every 2-3 releases. :)


pgp12EHSoPSAH.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: cap on writeback?

2013-03-25 Thread Valdis . Kletnieks
On Mon, 25 Mar 2013 17:23:40 -0700, Raymond Jennings said:

 Is there some sort of mechanism that throttles the size of the writeback pool?

There's a lot of tunables in /proc/sys/vm - everything from drop_caches
to swappiness to vfs_cache_pressure.  Note that they all interact in mystical
and hard-to-understand ways. ;)

 it's somewhat related to my brainfuck queue, since I would like to
 stress test it digesting a huge pile of outbound data and seeing if it
 can make writeback less seeky.

The biggest challenge here is that there's a bit of a layering violation
to be resolved - when the VM is choosing what pages get written out first,
it really has no clue where on disk the pages are going. Consider a 16M
file that's fragged into 16 1M extents - they'll almost certainly hit
the writeback queue in logical block order, not physical address order.
The only really good choices here are to either allow the writeback queue
to get deep enough that an elevator can do something useful (if you only
have 2-3 IOs queued, you can do less than if you have 20-30 of them you
can sort into some useful order), and filesystems that are less prone
to fragmentation issues

Just for the record, most of my high-performance stuff runs best with
the noop scheduler - when you're striping I/O across several hundred disks,
the last thing you want is some some single-minded disk scheduler re-arranging
the I/Os and creating latency issues for your striping.

Might want to think about why there's lots of man-hours spent doing
new filesystems and stuff like zcache and kernel shared memory, but the
only IO schedulers in tree are noop, deadline, and cfq :)


pgpkHyXSB4htP.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: initramfs_list

2013-03-27 Thread Valdis . Kletnieks
On Wed, 27 Mar 2013 20:38:54 +0800, ishare said:

  I am do some test  on kernel 2.6.0 and encountering an problem about 
 initramfs .

  I find my initramfs generated without a initramfs_list file ,which describes
  the list of files that will be created into the initramfs file . such as 
 /sbin/init /etc ...

What the kernel tree creates by default is a very small stub, basically only
what's needed to make sure that *some* sort of initramfs gets created so the
kernel doesn't panic on a stray pointer trying to access an uninitialized file
system.

To create something that will actually boot your system, please see
'man mkinitramfs' or 'man mkinitd' or 'man dracut' or similar for
whatever tool your actual distro uses to build a functional initramfs.


pgpE85l_nkRrX.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Creating mkfs for my custom filesystem

2013-03-29 Thread Valdis . Kletnieks
On Fri, 29 Mar 2013 15:44:49 +0530, Sankar P said:

 I have decided on a simple layout for my filesystem where the first
 block will be the super block and   will contain the version
 information etc. The second block will contain the list of inodes.
 Third block onwards will be data blocks. Each file can grow only up to
 a single block size. Thrid block will represent the first file, fourth
 block for the second file and so on. Directories will not be
 supported.

You *will* have to support at least the top-level directory, because
you'll need at least directory entries for . and ...  If your second
block is list of inodes, then you have directories (and adding subdir
support isn't *that* hard).

You'll also want either a list or bitmap structure or some other way to
determine if a given block is allocated - trying to write a new file without
having a freelist to get blocks from is hard.  Oh, and don't forget to
add locking around the freelist operations and similar things - having
two processes both grab block 27 for the file they just created can suck :)

 oh okay. But how do I create the superblock ? What are the APIs
 available to do these block level operations from a user space
 application (my mkfs program ) ?

struct foobar_suoper {
int version;
int num_files;
int free_blocks;
char padding[512-3*sizeof(int)];
};

struct foobar_super sb;

int disk;

bzero(sb, sizeof struct foobar_super);
sb.version = 1;
sb.num_files= 0;
sb.free_blocks = 999; /* should probably set to actual size of 
partition/file */

disk = open(*diskorfilename, );  /* testing on loop mounts is 
useful */
lseek ( disk, 0);
write (disk, sb, sizeof(sb)); /* congrats, you just wrote a superblock 
*/

Yes, it's that simple :)  You want to write some empty inodes, add a
'struct inode' variable, initialize it, lseek to were the inode goes and
write it out.

Just open, lseek, write, close.  ;)  And yes, those operations *do* work
just fine on both files you then use woth 'mount -o loop' and with /dev/sd*
or /dev/mapper/* LVM.

You might want to look at the source for mkfs.vfat (part of dosfstools package)
for additional details.


pgpCAJbuKLR1t.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Online migration of arbitrary filesystems, possible?

2013-03-29 Thread Valdis . Kletnieks
On Fri, 29 Mar 2013 17:09:14 -0300, Daniel Hilst said:

 The idea is, mount both filesystems together, and make write/read
 operations go on this way
 Read operations:
  1. See if data is already on dest fs,
  2. If is then read data and bright back to caller (lets call this
 cold read)
  3. If is not, then read file from source fs, put it on page cache,
 and change the backstorage of that page..
  3.1 So when this page get dirty or too old, it will be writed to

Any reason you can't just 'rsync /source-fs /dest-fs'?



pgpCeLeHQ48_a.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: why not choose another way to define the _IOC_xxxMASK related to the ioctl

2013-03-30 Thread Valdis . Kletnieks
On Sat, 30 Mar 2013 18:01:32 +0800, RS said:

 Now I think this will spend more time than the kernel code when executed.

Have you actually examined the generated code on several popular
architectures to see what gcc actually does?

(hint - many things can constant-folded at compile time.  So if
the 3 values are #defined to constants, the expression

(_IOC_NRSHIFT  _IOC_NRBITS) - _IOC_NRSHIFT)

will generate no actual shift or subtract instructions, merely another
constant.


pgp48fQJlrpug.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: what does it use two !!

2013-04-01 Thread Valdis . Kletnieks
On Mon, 01 Apr 2013 15:10:46 +0800, Ben Wu said:

 1 I found some placeuse two !!, what's means 
    if(button-gpio != INVALID_GPIO)
        state = !!((gpio_get_value(button-gpio) ? 1 : 0) ^ 
button-active_low);
    else

Gaah. That line of code fell out of the ugly tree and hit every branch
on the way down.

Use of !!  *and*  ? 1 :0  in the same line of code to do the same thing.
Ouch.


pgpWnNk2OVRgd.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Reading linux boot args

2013-04-02 Thread Valdis . Kletnieks
On Tue, 02 Apr 2013 12:12:09 +0900, manty kuma said:

 Is there any way i could read the reason for reboot.
 I want to read it so that i can get the reason that is stored.
 like 0xABADBABE is watchdog 0xCODEDEAD is panic. Etc..

 Please suggest an alternative approach.


See the 'pstore' persistent storage filesystem in fs/pstore.
That lets you store more than just an int (you can even get it
to stash your dmesg buffer).

'more fs/pstore/Kconfig' would be a good place to start.


pgpRELPvKNH33.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: cgroup.procs versus tasks (cgroups)

2013-04-02 Thread Valdis . Kletnieks
On Tue, 02 Apr 2013 16:46:24 +0300, Kevin Wilson said:
 Hi,
 Thanks  a lot Vlad. This explains it.
 - Does anybody know of a ps command (or a filter to  ps command)
 which will display only multithreaded
 processes (list processes by TGID) ?  (I know now about the option of
 displaying cgroup.procs , but is something parallel can be done with ps ? )

Have you tried 'ps -m' and friends?  Though it doesn't do exactly
what you wanted and *only* display multithreaded, you need to do some
post-processing:

$ ps max
...
   928 ?-  0:00 /sbin/auditd -n
 - -Ssl   0:00 -
 - -Ssl   0:00 -
   940 ?-  0:00 /sbin/audispd
 - -Ssl   0:00 -
 - -Ssl   0:00 -
   951 ?-  0:00 /usr/sbin/abrtd -d -s
 - -Ss 0:00 -
   960 ?-  0:00 /usr/bin/abrt-watch-log -F Backtrace 
/var/log/Xorg.0.log -- /usr/bin/abrt-dump-xorg -xD
 - -Ss 0:00 -

If there's 2 or more '- -' after the process entry, it's multi=threaded.

Note however that as far as the kernel is concerned, a single-threaded
process is handled by the code as a multi-threaded that happens to have
only one thread at the moment.  In other words, thinking that single and
multi threaded is different in some mystical way will probably end up
causing trouble for  you...


pgpcu3BQco7sL.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Online migration of arbitrary filesystems, possible?

2013-04-02 Thread Valdis . Kletnieks
On Mon, 01 Apr 2013 17:50:43 -0300, Daniel Hilst said:

  Any reason you can't just 'rsync /source-fs /dest-fs'?
 
 because I can't use dest-fs while rsynching

Sure you can. You just have to remember to pay attention to race
conditions - if you create foo/bar.dat on the dest and then rsync
wants to copy over a foo/bar.tar from the source, things will go
poorly.

However, if you wanted to write to the dest while doing your sync,
you'll have that issue no matter *what* method you use to do it.

 Read operations:
   1. See if data is already on dest fs,
   2. If is then read data and bright back to caller (lets call this
 cold read)
   3. If is not, then read file from source fs, put it on page cache,
 and change the backstorage of that page..
   3.1 So when this page get dirty or too old, it will be writed to

You may want to look for 'overlayfs' and 'unionfs', which may provide
you the function you need. (Note there's several different patchsets
calling themselves 'unionfs').



pgpTPJOBUFmxB.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Method to calculate user space thread size

2013-04-03 Thread Valdis . Kletnieks
On Wed, 03 Apr 2013 14:03:40 +0530, naveen yadav said:

 I have code written, and I cannot modify. I want to fix user stack size for
 all threads in glibc,

'man ulimit'?


pgp5JyPP87h7J.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: NMIs, Locks, linked lists, barriers, and rculists, oh my!

2013-04-03 Thread Valdis . Kletnieks
On Wed, 03 Apr 2013 19:08:45 -0700, Arlie Stephens said:

 - I've got a linked list, with entries occassionally added from normal
 contexts.
 - Entries are never deleted from the list.

This is already busted - you *will* eventually OOM the system this way.

 This would be simple, except that the routines which READ the list may
 get called from panic(), or inside an oops, or from an NMI. It's
 important that they succeed in doing their jobs in that case, even if
 they managed to interrupt a list addition in progress.

At that point, you need to be *really* specific on what the semantics
of succeed really are.  In particular, do they merely need to
succeed at reading a single specific entry, or the entire list?

You should also be clear on why panic() and oops need to poke the list
(hint - if you're in panic(), you're there because the kernel is already
convinced things are too screwed up to continue, so why should you
trust your list enough to walk across it during panic()?)

 1) Use an rculist, and hope that it really does cover this
 situation.

That's probably safe/sufficient, as long as the read semantics of
an RCU-protected region are acceptable. In particular, check the
behavior of when an update happens, compared to when it becomes
visible to readers.


pgpjWkWqN_aAn.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: simple question about the function memcmp in kernel

2013-04-07 Thread Valdis . Kletnieks
On Mon, 08 Apr 2013 08:57:01 +0800, Ben Wu said:

 int memcmp(const void *cs, const void *ct, size_t count)
 {

 I want to know why it use the temp pointer su1, su2? why it doesn't directly
 use the cs and ct pointer?

This is a C 101 question, not a kernel question.  But anyhow..

They're declared const, so the compiler will whine about ++'ing them.



pgp54ftb3974m.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: simple question about the function memcmp in kernel

2013-04-07 Thread Valdis . Kletnieks
On Mon, 08 Apr 2013 05:56:29 +0400, Max Filippov said:

 const is the the object they point to, not the pointers themselves
 (that would be
 void * const cs).

 memcmp compares bytes at which cs and ct point, but these are void pointers,
 and the expression res = *cs - *ct is thus meaningless.

Max is right, and I'm obviously under-caffienated or something. :)


pgpBklAZO3h8m.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: How will Linux community support the coming Intel new chip Bay Trail?

2013-04-11 Thread Valdis . Kletnieks
On Thu, 11 Apr 2013 23:08:21 +0800, Peter Xu said:
 Hi, all,

 It seems that Intel will publish a nice chip called Bay Trail (or plus,
 I don't quick sure, which is for smartphones/tablets, also some lower
 ends of laptops in the future). It was said publically that Intel will
 support Linux platform on that chip. I just want to know something from
 the communitiy side, that is there a plan or something related to that
 chip?

Over the past few years, Intel has been quite good about contributing
drivers for coming chips, often before official numeric part numbers have
even been assigned.  I don't know about this particular chip, but there's
a good chance that there will be in-tree support before the chip officially
releases for general use.


pgpLdJsgudXSn.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: net_device: limit rate ot tx packets

2013-04-14 Thread Valdis . Kletnieks
On Sun, 14 Apr 2013 10:09:54 +0200, mic...@michaelblizek.twilightparadox.com 
said:

 This is not what I meant. When the qdisc has a size of say 256KB and the
 socket memory is, say 128kb, the socket memory limit will be reached before
 the qdisc limit and the socket will sleep. But when the socket memory limit
 is greater than the qdisc limit, it will be interesting whether the socket
 still sleeps or starts dropping packets.

How to figure this out for yourself:

Look at net/sched/sch_plug.c, which is a pretty simple qdisc (transmit packets
until a plug request is recieved, then queue until unplugged). In particular,
look at plug_enqueue() to see what happens when q-limit is exceeded, and
plug_init() to see where q-limit is set.

Then look at the definition of qdisc_reshape_fail() in
include/net/sch_generic.h to figure out what the qdisc returns if q-limit is
exceeded.

Then go look at net/core/dev.c, in function __dev_xmit_skb(), and
watch the variable 'rc'.

Now go find the caller of __dev_xmit_skb() and see what it does with
that return value.

Hope that helps...



pgpP8jeZUYp64.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: So I want to get some kernel routine called....

2013-04-16 Thread Valdis . Kletnieks
On Tue, 16 Apr 2013 16:33:17 -0700, Arlie Stephens said:

 I have some kernel routine I'd like to get called, with the decision
 to call it made in user space.

The proper answer here is *highly* dependent on exactly what this routine
has to do once it's called.

Can you explain the problem the routine is trying to solve?  Quite often,
by the time you get to the i need to call a routine stage, you've stopped
seeing the forest for the trees, and stepping back and looking at the actual
problem to be solved rather than a proposed solution will provide insight.


pgp6HRCIoBp_i.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Building kernel modules with debuginfo and printing line numbers in kernel oops message / coredump

2013-04-19 Thread Valdis . Kletnieks
On Fri, 19 Apr 2013 23:55:49 +0530, Sankar P said:

 myfunctionname +0x2507 +5679

That function is too honking big and needs to be refactored. :)


pgp5Kg1q_h7vQ.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: oops in a kernel module

2013-04-28 Thread Valdis . Kletnieks
On Sat, 27 Apr 2013 19:34:00 +0300, Kevin Wilson said:
 Hello,

 static int __init init_zeromib(void)

This is your init routine...

 {
 int ret = 0;
 printk(in %s\n,__func__);

Missing KERN_DEBUG or similar here.  This can cause it to fail
to appear in dmesg output, causing much confusion.


 #define SNMP_ZERO_STATS(mib, field) 
 this_cpu_add(mib[0]-mibs[field],-(mib[0]-mibs[field]))

You *do* realize that this doesn't in fact zero the statistics, right?

If you have a 32-core machine, this will zero 1/32 of the statistics.

this_cpu_add and friends are there specifically so that on multi-core systems
there's a lockless way to update the statistics values - to actually find
the values, you need to walk across all the per_cpu areas and sum them
up.

And why for the love of all that is good did you do this bletcherous thing
with this_cpu_add() instead of using 'this_cpu_write(whatever, 0)'? Or at
least use this_cpu_sub()? ;)



pgpVU13OrNS6Q.pgp
Description: PGP signature
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


  1   2   3   4   5   6   7   8   9   10   >