from:"Dhaval Giani"

Re: [PATCH -tip 23/32] sched: Add a per-thread core scheduling interface

2020-12-15 Thread Dhaval Giani

On 12/14/20 3:25 PM, Joel Fernandes wrote:

>> No problem. That was there primarily for debugging.
> Ok. I squashed Josh's changes into this patch and several of my fixups. So
> there'll be 3 patches:
> 1. CGroup + prctl  (single patch as it is hell to split it)

Please don't do that. I am not sure we have thought the cgroup interface through
(looking at all the discussions). IMHO, it would be better for us to get a 
simpler
interface (prctl) right and then once we learn the lessons using it, apply it 
for
the cgroup interface. I think we all agree we don't want to maintain a messy 
interface
forever.

Dhaval

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Dhaval Giani

On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai  wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling.  We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> --
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a 
> new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
>  This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
>   Can be used for specifying future flag requirements as well.
>   Current list of properties include:
>   NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
> creation of a new trust group
>   SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
>  this same group
>   ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - : A directory per task who is am member of this trust group.
> - /properties: This file is same as the parent properties file
> but this is to override
> the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
>of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
>write, read. Or can come up with new APIs using the cookie_id which can 
> wrap
>   t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another psuedo filesystem.
> - very similar to  cgroup v1 and might be re-implementing features
> that are already
>   provided by cgroups v1.
>
> Use Cases
> -
>
> Usecase 1: Google cloud
> -
>
> Since we no longer depend on cgroup v2 hierarchies, there will not be
> any issue of
> nesting and sharing. The m

Re: [RFC] Design proposal for upstream core-scheduling interface

2020-08-24 Thread Dhaval Giani

On Fri, Aug 21, 2020 at 8:01 PM Joel Fernandes  wrote:
>
> Hello!
> Core-scheduling aims to allow making it safe for more than 1 task that trust
> each other to safely share hyperthreads within a CPU core [1]. This results
> in a performance improvement for workloads that can benefit from using
> hyperthreading safely while limiting core-sharing when it is not safe.
>
> Currently no universally agreed set of interface exists and companies have
> been hacking up their own interface to make use of the patches. This post
> aims to list usecases which I got after talking to various people at Google
> and Oracle. After which actual development of code to add interfaces can 
> follow.
>
> The below text uses the terms cookie and tag interchangeably. Further, cookie
> of 0 is assumed to indicate a trusted process - such as kernel threads or
> system daemons. By default, if nothing is tagged then everything is
> considered trusted since the scheduler assumes all tasks are a match for each
> other.
>
> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> int32 is split into 2 parts, the color and the id. The color can only be set
> by privileged processes and the id can be set by anyone. The CGroup structure
> looks like:
>
>A B
>   / \  / \ \
>  C   DE  F  G
>
> Here A and B are container CGroups for 2 jobs are assigned a color by a
> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> E, F and G). When these sub-CGroups are spawned, they inherit the color from
> the parent. An unprivileged user can then set an id for the sub-CGroup
> without the knowledge of the privileged daemon if it desires to add further
> isolation. This setting of id can be an unprivileged operation because the
> root daemon has already isolated A and B.
>
> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> spawns a renderer. A renderer is a sandboxed process and it is assumed it
> could run arbitrary code (Javascript etc). When a renderer is created, a
> prctl call is made to tag the renderer. Every thread that is spawned by the
> renderer is also tagged. Essentially this turns SMT off for the renderer, but
> still gives a performance boost due to privileged system threads being able
> to share a core. The tagging also forbids the renderer from sharing a core
> with privileged system processes. In the future, we plan to allow threads to
> share a core as well (especially once we get syscall-isolation upstreamed.
> Patches were posted recently for the same [2]).
>
> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> tagged thus disallowing core sharing between the vCPU thread and any other
> thread on the system. This is because such VMs may run arbitrary user code
> and attack both the guest and the host systems sharing the core.
>
> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
> to not have to share its parent's CGroup tag. In fact, it should be allowed to
> untag the child CGroup if needed thus allowing them to share a core with
> trusted tasks. Others have had similar requirements.
>

Just to augment this. This doesn't necessarily need to be cgroup
based. We do have a need where certain processes want to be tagged
separately from others, which are in the same cgroup hierarchy. The
standard mechanism for this is nested cgroups. With a unified
hierarchy, and with cgroup tagging, I am unsure what this really
means. Consider

root
|- A
|- A1
|- A2

If A is tagged, can processes in A1 and A2 share a core? Should they
share a core? In some cases we might be OK with them sharing cores
just to get some of the performance back. Core scheduling really needs
to be limited to just the processes that we want to protect.

> Proposal for tagging
> 
> We have to support both CGroup and non-CGroup users. CGroup may be overkill
> for some and the CGroup v2 unified hierarchy may be too inflexible.
> Regardless, we must support CGroup due its easy of use and existing users.
>
> For Usecase #1
> --
> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
> to the CPU controller:
> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
>   tagged.  (In the kernel, the cookie will be derived from the pointer value
>   of a ref-counted cookie object.). If reset, then the CGroup will inherit
>   the parent CGroup's cookie if there is one.
>
> - color : The ref-counted object will be aligned say to a 256-byte boundary
>   (for example), then the lower 8 bits of the pointer can be used to specify
>   color. Together, the pointer with the color will form a cookie used by the
>   scheduler.
>
> Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
> their color to be the same does not imply that the 2 groups will share a
> core. This is key.

Why?

[CFP LPC 2020] Scheduler Microconference

2020-07-29 Thread Dhaval Giani

Hi all,

We are pleased to announce the Scheduler Microconference has been
accepted at LPC this year.

Please submit your proposals on the LPC website at:

https://www.linuxplumbersconf.org/event/7/abstracts/#submit-abstract

And be sure to select "Scheduler MC" in the Track pulldown menu.


Topics we are interested in, but certainly not limited to, this year are,

- Load Balancer Rework
- Idle Balance optimizations
- Flattening the group scheduling hierarchy
- Core scheduling
- Proxy Execution for CFS
- What was formerly known as latency nice

Please get your submissions in by Aug 7th!

Thanks!
The organizers of the Scheduler Microconference

Re: CFP: LPC Testing and Fuzzing microconference.

2019-07-24 Thread Dhaval Giani

On Tue, Jul 2, 2019 at 1:12 PM Dhaval Giani  wrote:
>
> Hi folks,
>
> I am pleased to announce the Testing Microconference has been accepted
> at LPC this year.
>
> The CfP process is now open, and please submit your talks on the LPC
> website. It can be found at
> https://linuxplumbersconf.org/event/4/abstracts/
>
> Potential topics include, but are not limited to
> - Defragmentation of testing infrastructure: how can we combine
> testing infrastructure to avoid duplication.
> - Better sanitizers: Tag-based KASAN, making KTSAN usable, etc
> - Better hardware testing, hardware sanitizers.
> - Are fuzzers "solved"?
> - Improving RT testing.
> - Using clang for better testing coverage.
> - Unit test framework.
> - The future of kernelCI
>

Hi all,

Just a reminder, the CfP is open for the microconference and proposals
are being accepted. We plan to start selecting topics starting Aug 11
and  can let you know that if you don't get your topic in, it will not
be selected!

https://linuxplumbersconf.org/event/4/abstracts/

Thanks!
Dhaval and Sasha

CFP: LPC Testing and Fuzzing microconference.

2019-07-02 Thread Dhaval Giani

Hi folks,

I am pleased to announce the Testing Microconference has been accepted
at LPC this year.

The CfP process is now open, and please submit your talks on the LPC
website. It can be found at
https://linuxplumbersconf.org/event/4/abstracts/

Potential topics include, but are not limited to
- Defragmentation of testing infrastructure: how can we combine
testing infrastructure to avoid duplication.
- Better sanitizers: Tag-based KASAN, making KTSAN usable, etc
- Better hardware testing, hardware sanitizers.
- Are fuzzers "solved"?
- Improving RT testing.
- Using clang for better testing coverage.
- Unit test framework.
- The future of kernelCI

Thanks!
Dhaval and Sasha

Re: Linux Testing Microconference at LPC

2019-05-22 Thread Dhaval Giani

> Please let us know what topics you believe should be a part of the
> micro conference this year.

At OSPM right now, Douglas and Ionela were talking about their
scheduler behavioral testing framework using LISA and rt-app. This is
an interesting topic, and I think has a lot of scope for making
scheduler testing/behaviour more predictable as well as
analyze/validate scheduler behavior. I am hoping they are able to make
it to LPC this year.

Dhaval

Re: Linux Testing Microconference at LPC

2019-05-22 Thread Dhaval Giani

On Wed, May 22, 2019 at 6:04 PM Dmitry Vyukov  wrote:
>
> On Thu, May 16, 2019 at 2:51 AM  wrote:
> > > -Original Message-
> > > From: Sasha Levin
> > >
> > > On Fri, Apr 26, 2019 at 02:02:53PM -0700, Tim Bird wrote:
> > ...
> > > >
> > > >With regards to the Testing microconference at Plumbers, I would like
> > > >to do a presentation on the current status of test standards and test
> > > >framework interoperability.  We recently had some good meetings
> > > >between the LAVA and Fuego people at Linaro Connect
> > > >on this topic.
> > >
> > > Hi Tim,
> > >
> > > Sorry for the delayed response, this mail got marked as read as a result
> > > of fat fingers :(
> > >
> > > I'd want to avoid having an 'overview' talk as part of the MC. We have
> > > quite a few discussion topics this year and in the spirit of LPC I'd
> > > prefer to avoid presentations.
> >
> > OK.  Sounds good.
> >
> > > Maybe it's more appropriate for the refereed track?
> > I'll consider submitting it there, but there's a certain "fun" aspect
> > to attending a conference that I don't have to prepare a talk for. :-)
> >
> > Thanks for getting back to me.  I'm already registered for Plumbers,
> > so I'll see you there.
> >  -- Tim
>
>
> I would like to give an update on syzkaller/syzbot and discuss:
>  - testability of kernel components in this context
>  - test coverage and what's still not tested
>  - discussion of the process (again): what works, what doesn't work, feedback
>

This sounds good to me.

> I also submitted a refereed track talk called "Reflections on kernel
> quality, development process and testing". If it's not accepted, I
> would like to do it on Testing MC.

I don't think refereed talks fit in the MC

Linux Testing Microconference at LPC

2019-04-11 Thread Dhaval Giani

Hi Folks,

This is a call for participation for the Linux Testing microconference
at LPC this year.

For those who were at LPC last year, as the closing panel mentioned,
testing is probably the next big push needed to improve quality. From
getting more selftests in, to regression testing to ensure we don't
break realtime as more of PREEMPT_RT comes in, to more stable distros,
we need more testing around the kernel.

We have talked about different efforts around testing, such as fuzzing
(using syzkaller and trinity), automating fuzzing with syzbot, 0day
testing, test frameworks such as ktests, smatch to find bugs in the
past. We want to push this discussion further this year and are
interested in hearing from you what you want to talk about, and where
kernel testing needs to go next.

Please let us know what topics you believe should be a part of the
micro conference this year.

Thanks!
Sasha and Dhaval

Re: [PATCH v4 00/10] steal tasks to improve CPU utilization

2019-01-31 Thread Dhaval Giani



> 
> On 12/6/2018 4:28 PM, Steve Sistare wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates.  To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed.  For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This simple stealing yields a higher CPU utilization than idle_balance()
>> alone, because the search is cheap, so it may be called every time the CPU
>> is about to go idle.  idle_balance() does more work because it searches
>> widely for the busiest queue, so to limit its CPU consumption, it declines
>> to search if the system is too busy.  Simple stealing does not offload the
>> globally busiest queue, but it is much better than running nothing at all.
>>
>> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
>> reduce cache contention vs the usual bitmap when many threads concurrently
>> set, clear, and visit elements.
>>
>> Patch 1 defines the sparsemask type and its operations.
>>
>> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>>
>> Patches 5 and 6 refactor existing code for a cleaner merge of later
>>   patches.
>>
>> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>>
>> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
>> time being because of performance regressions that are not due to stealing
>> per-se.  See the patch description for details.
>>
>> Patch 10 adds schedstats for comparing the new behavior to the old, and
>>   provided as a convenience for developers only, not for integration.
>>
>> The patch series is based on kernel 4.20.0-rc1.  It compiles, boots, and
>> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
>> and CONFIG_PREEMPT.  It runs without error with CONFIG_DEBUG_PREEMPT +
>> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
>> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP.  CPU hot plug and CPU
>> bandwidth control were tested.
>>
>> Stealing improves utilization with only a modest CPU overhead in scheduler
>> code.  In the following experiment, hackbench is run with varying numbers
>> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
>> for each run, averaged per CPU, augmented with these non-standard stats:
>>
>>   %find - percent of time spent in old and new functions that search for
>> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>>
>>   steal - number of times a task is stolen from another CPU.
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> hackbench  process 10
>> sched_wakeup_granularity_ns=1500
>>
>>   baseline
>>   grps  time  %busy  slice   sched   idle wake %find  steal
>>   18.084  75.02   0.10  105476  4629159183  0.31  0
>>   2   13.892  85.33   0.10  190225  70958   119264  0.45  0
>>   3   19.668  89.04   0.10  263896  87047   176850  0.49  0
>>   4   25.279  91.28   0.10  322171  94691   227474  0.51  0
>>   8   47.832  94.86   0.09  630636 144141   486322  0.56  0
>>
>>   new
>>   grps  time  %busy  slice   sched   idle wake %find  steal  %speedup
>>   15.938  96.80   0.24   31255   719024061  0.63   7433  36.1
>>   2   11.491  99.23   0.16   74097   457869512  0.84  19463  20.9
>>   3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
>>   4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
>>   8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6
>>
>> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
>> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
>> The cost is at most 0.4% more find time.
>>
>> Additional performance results follow.  A negative "speedup" is a
>> regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
>> is set to 15 msec.  Otherwise, preemptions increase at higher loads and
>> distort the comparison between baseline and new.
>>
>> -- 1 Socket Results --
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench  process 10
>>
>> --- base ----- new ---
>>   groupstime %stdevtime %stdev  %speedup
>>1   8.0080.1   5.9050.2  35.6
>>2  13.8140.2  11.4380.1  20.7
>>3  19.4880.2  16.9190.1  15.1
>>4  25.0590.1  22.4090.1  11.8
>>8  47.4780.1  44.2210.1   7.3
>>
>> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
>> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> Average of 10 runs of: hackbench  process 10
>>
>>

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-11-08 Thread Dhaval Giani

On Mon, Nov 5, 2018 at 10:05 AM Gustavo Padovan
 wrote:
>
> Hi Dhaval,
>
> On 9/19/18 7:13 PM, Dhaval Giani wrote:
> > Hi folks,
> >
> > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > LPC [ 1 ]. We are planning to continue the discussions from last
> > year's microconference [2]. Many discussions from the Automated
> > Testing Summit [3] will also continue, and a final agenda will come up
> > only soon after that.
> >
> > Suggested Topics
> >
> > - Syzbot/syzkaller
> > - ATS
> > - Distro/stable testing
> > - kernelci
> > - kernelci auto bisection
>
> Having 2 kernelci talks don't make too much sense, I discussed with
> Kevin and we thing it would be a good idea to merge them together. Could
> you do that?
>

OK, we can make that happen. 45 minutes for the 2 combined topics?

> Thanks,
>
> Gustavo
>
>
> Gustavo Padovan
> Collabora Ltd
>

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-10-10 Thread Dhaval Giani

On Mon, Oct 8, 2018 at 11:23 AM Steven Rostedt  wrote:
>
> On Mon, 8 Oct 2018 19:02:51 +0200
> Dmitry Vyukov  wrote:
>
> > On Wed, Sep 19, 2018 at 7:13 PM, Dhaval Giani  
> > wrote:
> > > Hi folks,
> > >
> > > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > > LPC [ 1 ]. We are planning to continue the discussions from last
> > > year's microconference [2]. Many discussions from the Automated
> > > Testing Summit [3] will also continue, and a final agenda will come up
> > > only soon after that.
> > >
> > > Suggested Topics
> > >
> > > - Syzbot/syzkaller
> > > - ATS
> > > - Distro/stable testing
> > > - kernelci
> > > - kernelci auto bisection
> > > - Unit testing framework
> > >
> > > We look forward to other interesting topics for this microconference
> > > as a reply to this email.
> >
> > Hi Dhaval and Sasha,
> >
> > My syzbot talk wasn't accepted to main track, so I would like to do
> > more or less full-fledged talk on the microconf. Is it possible?
>
> Hi Dmitry,
>
> Note, microconfs are not for full-fledged talks. They are to be
> discussion focused. You can have a 5-10 minute presentation that leads
> up to discussion of future work, but we like to refrain from any talks
> about what was done if there's nothing to go forward with.

Dmitiry,

Can you clarify the scope of what you want to discuss during the
microconference? Further to what Steven said, we don't want
presentations (So 3, maybe 4 slides). We want discussions about future
work.

Thanks!
Dhaval

Re: [Announce] LPC 2018: Testing and Fuzzing Microconference

2018-10-03 Thread Dhaval Giani

On Tue, Oct 2, 2018 at 2:03 PM Sasha Levin  wrote:
>
> On Tue, Oct 2, 2018 at 4:44 PM Liam R. Howlett  
> wrote:
> >
> > * Dhaval Giani  [180919 13:15]:
> > > Hi folks,
> > >
> > > Sasha and I are pleased to announce the Testing and Fuzzing track at
> > > LPC [ 1 ]. We are planning to continue the discussions from last
> > > year's microconference [2]. Many discussions from the Automated
> > > Testing Summit [3] will also continue, and a final agenda will come up
> > > only soon after that.
> > >
> > > Suggested Topics
> > >
> > > - Syzbot/syzkaller
> > > - ATS
> > > - Distro/stable testing
> > > - kernelci
> > > - kernelci auto bisection
> > > - Unit testing framework
> > >
> > > We look forward to other interesting topics for this microconference
> > > as a reply to this email.
> > >
> > > Thanks!
> > > Dhaval and Sasha
> > >
> > > [1] https://blog.linuxplumbersconf.org/2018/testing-and-fuzzing-mc/
> > > [2] https://lwn.net/Articles/735034/
> > > [3] https://elinux.org/Automated_Testing_Summit
> >
> >
> > Hello,
> >
> > I have a new way to analyze binaries to detect specific calls without
> > the need for source.  I would like to discuss Machine Code Trace
> > (MCTrace) at the Testing and Fuzzing LPC track.  MCTrace intercepts the
> > application prior to execution and does not rely on a specific user
> > input. It then decodes the machine instructions to follow all control
> > flows to their natural conclusions.  This includes control flows that go
> > beyond the boundaries of the static executable code into shared
> > libraries. This new technique avoids false positives which could be
> > produced by static analysis and includes paths that could be missed by
> > dynamic tracing.  This type of analysis could be useful in both testing
> > and fuzzing by providing a call graph to a given function.
> >
> > MCTrace was initially designed to help generate the seccomp() filter
> > list, which is a whitelist/blacklist of system calls for a specific
> > application. Seccomp filters easily become outdated when the application
> > or shared library is updated. This can cause failures or security
> > issues [ 1 ].  Other potential uses including examining binary blobs,
> > vulnerability analysis, and debugging.
>
> Hi Liam,
>
> Is MCTrace available anywhere?
>

Sasha,

McTrace is an early prototype, really needing a lot of feedback. I
will let Liam send more details (some how he got dropped from the cc)

Dhavla

>
> --
> Thanks,
> Sasha

[Announce] LPC 2018: Testing and Fuzzing Microconference

2018-09-19 Thread Dhaval Giani

Hi folks,

Sasha and I are pleased to announce the Testing and Fuzzing track at
LPC [ 1 ]. We are planning to continue the discussions from last
year's microconference [2]. Many discussions from the Automated
Testing Summit [3] will also continue, and a final agenda will come up
only soon after that.

Suggested Topics

- Syzbot/syzkaller
- ATS
- Distro/stable testing
- kernelci
- kernelci auto bisection
- Unit testing framework

We look forward to other interesting topics for this microconference
as a reply to this email.

Thanks!
Dhaval and Sasha

[1] https://blog.linuxplumbersconf.org/2018/testing-and-fuzzing-mc/
[2] https://lwn.net/Articles/735034/
[3] https://elinux.org/Automated_Testing_Summit

Re: [PATCH v3 0/4] Ktest: add email support

2018-04-03 Thread Dhaval Giani

On 2018-03-26 04:08 PM, Tim Tianyang Chen wrote:
> This patch set will let users define a mailer, an email address and when to 
> receive
> notifications during automated testings. Users need to setup the specified 
> mailer
> prior to using this feature.
> 
> Tim Tianyang Chen (4):
>   Ktest: add email support
>   Ktest: add SigInt handling
>   Ktest: use dodie for critical falures
>   Ktest: add email options to sample.config
> 
>  ktest.pl| 125 
> +---
>  sample.conf |  22 +++
>  2 files changed, 117 insertions(+), 30 deletions(-)
> 

Steve,

Any thoughts?

Thanks!
Dhaval

Re: [PATCH] lockdep: Show up to three levels for a deadlock scenario

2017-12-19 Thread Dhaval Giani

On 2017-12-19 11:52 AM, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 17:46:19 +0100
> Peter Zijlstra  wrote:
> 
> 
>> It really isn't that hard, Its mostly a question of TL;DR.
>>
>> #0 is useless and should be thrown out
>> #1 shows where we take #1 while holding #0
>> ..
>> #n shows where we take #n while holding #n-1
>>
>> And the bottom callstack shows where we take #0 while holding #n. Which
>> gets you a nice circle in your graph, which spells deadlock.
>>
>> Plenty people have shown they get this stuff.
> 
> 
> Then I suggest that you can either take my patch to improve the
> visual or remove the visual completely, as nobody cares about it.
> 

I prefer the former. As Steven has mentioned elsewhere, people find
lockdep output hard to follow (enough that he has given talks :) )

Dhaval

Re: [PATCH] lockdep: Show up to three levels for a deadlock scenario

2017-12-19 Thread Dhaval Giani

On 2017-12-14 12:59 PM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 12:38:52PM -0500, Steven Rostedt wrote:
>>
>> Currently, when lockdep detects a possible deadlock scenario that involves 3
>> or more levels, it just shows the chain, and a CPU sequence order of the
>> first and last part of the scenario, leaving out the middle level and this
>> can take a bit of effort to understand. By adding a third level, it becomes
>> easier to see where the deadlock is.
> 
> So is anybody actually using this? This (together with the callchain for
> #0) is always the first thing of the lockdep output I throw away.
> 

Yes :-). The other stuff is unreadable to people not you.

Dhaval

Re: [PATCH 0/2] [RFC] Ktest: add email support

2017-12-14 Thread Dhaval Giani

On 2017-12-06 04:40 PM, Steven Rostedt wrote:
> Hi,
> 
> Currently traveling and now I have very poor connectivity. I won't be able to 
> do anything this week.
> 

ping! :)

Dhaval

Re: [PATCH 0/2] [RFC] Ktest: add email support

2017-12-06 Thread Dhaval Giani

On 2017-12-01 06:55 PM, Steven Rostedt wrote:
> On Tue, 21 Nov 2017 10:53:27 -0800
> Tim Tianyang Chen  wrote:
> 
>> This patch series will let users define mailer and email address for 
>> receiving
>> notifications during automated testings. Users need to setup the specified 
>> mailer
>> prior to using this feature.
>>
>> Emails will be sent when the script completes, is aborted due to errors or 
>> interrupted
>> by Sig-Int.
>>
> 
> Hi Tim,
> 
> I was hoping to get to these this week, but unfortunately I wasn't able
> to finish my current workload. I leave tomorrow for Germany, and
> hopefully I can spend some time looking at these on that trip.
> 
> Feel free to send me a ping if you don't hear from me next week.
> 

Ping!

Dhaval

Re: cgroups and nice

2016-11-28 Thread Dhaval Giani

[Resending because gmail doesn't understand when to go plaintext :-) ]
[Added a few other folks who might have something to say about it]

On Fri, Nov 25, 2016 at 9:34 AM, Marat Khalili  wrote:
> I have a question as a cgroup cpu limits user: how does it interact with
> nice? Documentation creates the impression that, as long as number of
> processes demanding the cpu time exceeds number of available cores, time
> allocated will be proportional to configured cpu.shares. However, in
> practice I observe that group with niced processes significantly under
> perform.
>
> For example, suppose on a 6-core box /cgroup/cpu/group1/cpu.shares is 400,
> and /cgroup/cpu/group2/cpu.shares is 200.
> 1) If I run `stress -c 6` in both groups, I should see approximately 400% of
> cpu time in group1 and 200% in group2 in top output, regardless of their
> relative nice value.
> 2) If I run `nice -n 19 stress -c 1` in cgroup1 and `stress -c 24` in
> group2, I should see at least 100% of cpu time in group1.
>
> What I see is significantly less cpu time in group1 if group1 processes
> happen to have greater nice value, and especially if group2 have greater
> number of processes involved: cpu load of group1 in example 2 can be as low
> as 20%. It may create tensions among users in my case; how can this be
> avoided except by renicing all processes to the same value?
>
>> $ uname -a
>> Linux redacted 2.6.32-642.11.1.el6.x86_64 #1 SMP Fri Nov 18 19:25:05 UTC
>> 2016 x86_64 x86_64 x86_64 GNU/Linux
>

This is an old version of the kernel. Do you see the same behavior on
a newer version of the kernel? (4.8 is the latest stable kernel)

>
>> $ lsb_release -a
>> LSB Version:
>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>> Distributor ID: CentOS
>> Description:CentOS release 6.8 (Final)
>> Release:6.8
>> Codename:   Final
>
>
> (My apologies if I'm posting to incorrect list.)
>
> --
>
> With Best Regards,
> Marat Khalili
> --

Thanks,
Dhaval

Re: [PATCH v2 tip/core/rcu 05/13] decnet: Apply rcu_access_pointer() to avoid sparse false positive

2013-10-09 Thread Dhaval Giani

On Wed, Oct 9, 2013 at 5:29 PM, Paul E. McKenney
 wrote:
>
> From: "Paul E. McKenney" 
>
> The sparse checking for rcu_assign_pointer() was recently upgraded
> to reject non-__kernel address spaces.  This also rejects __rcu,
> which is almost always the right thing to do.  However, the use in
> dn_insert_route() is legitimate: It is assigning a pointer to an element
> from an RCU-protected list, and all elements of this list are already
> visible to caller.
>
> This commit therefore silences this false positive by laundering the
> pointer using rcu_access_pointer() as suggested by Josh Triplett.
>
> Reported-by: kbuild test robot 


I did not realize that we were allowed to rename people :-)

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ftrace: Fixup !CONFIG_TRACING trace_dump_stack

2013-08-02 Thread Dhaval Giani

Hi Steve,

And since gmail will mangle this up, I have attached it as well.

Thanks!
Dhaval

commit 6379b752b4c9e5f9edf9894723be7520a987d2b5
Author: Dhaval Giani 
Date:   Fri Aug 2 14:42:53 2013 -0400

ftrace: Fixup !CONFIG_TRACING trace_dump_stack

!TRACING does not take an argument for trace_dump_stack. Fix it.

Signed-off-by: Dhaval Giani 

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e9ef6d6..4b7cc46 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -631,7 +631,7 @@ extern void ftrace_dump(enum ftrace_dump_mode
oops_dump_mode);
 static inline void tracing_start(void) { }
 static inline void tracing_stop(void) { }
 static inline void ftrace_off_permanent(void) { }
-static inline void trace_dump_stack(void) { }
+static inline void trace_dump_stack(int skip) { }

 static inline void tracing_on(void) { }
 static inline void tracing_off(void) { }


trace.patch
Description: Binary data

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 2013-07-25 1:53 PM, Jörn Engel wrote:

On Thu, 25 July 2013 09:42:18 -0700, Taras Glek wrote:

Footprint wins are useful on android, but it's the
increased IO throughput on crappy storage devices that makes this
most attractive.

All the world used to be a PC.  Seems to be Android these days.

The biggest problem with compression support in the past was the
physical properties of hard drives (the spinning type, if you can
still remember those).  A random seek is surprisingly expensive, of a
similar cost to 1MB or more of linear read.  So anything that
introduces more random seeks will kill the preciously little
performance you had to begin with.

As long as files are write-once and read-only from that point on, you
can just append a bunch of compressed chunks on the disk and nothing
bad happens.  But if you have a read-write file with random overwrites
somewhere in the middle, those overwrites will change the size of the
compressed data.  You have to free the old physical blocks on disk and
allocate new ones.  In effect, you have auto-fragmentation.

So if you want any kind of support for your approach, I suspect you
should either limit it to write-once files or prepare for a mob of
gray-haired oldtimers with rainbow suspenders complaining about
performance on their antiquated hardware.  And the mob may be larger
than you think.


Yes, we plan to limit it to write-once. In order to write, you have to
replace the file.

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 2013-07-25 2:15 PM, Vyacheslav Dubeyko wrote:

On Jul 25, 2013, at 8:42 PM, Taras Glek wrote:

[snip]

To introduce transparent decompression. Let someone else do the compression for 
us, and supply decompressed data on demand  (in this case a read call). Reduces 
the complexity which would otherwise have to be brought into the filesystem.

The main use for file compression for Firefox(it's useful on Linux desktop too) 
is to improve IO-throughput and reduce startup latency. In order for 
compression to be a net win an application should be aware of what is being 
compressed and what isn't. For example patterns for IO on large libraries (eg 
30mb libxul.so) are well suited to compression, but SQLite databases are not.  
Similarly for our disk cache: images should not be compressed, but javascript 
should be. Footprint wins are useful on android, but it's the increased IO 
throughput on crappy storage devices that makes this most attractive.

In addition of being aware of which files should be compressed, Firefox is 
aware of patterns of usage of various files it could schedule compression at 
the most optimal time.

Above needs tie in nicely with the simplification of not implementing 
compression at fs-level.

There are many filesystems that uses compression as internal technique. And, as 
I understand, implementation
of compression in different Linux kernel filesystem drivers has similar code 
patterns. So, from my point of view,
it makes sense to generalize compression/decompression code in the form of 
library. The API of such generalized
compression kernel library can be used in drivers of different file systems. 
Also such generalized compression
library will simplify support of compression in file system drivers that don't 
support compression feature currently.

Moreover, I think that it is possible to implement compression support on VFS 
level. Such feature gives
opportunity to have compression support for filesystems that don't support 
compression feature as
internal technique.


I am not sure it is a very good idea at this stage.

[snip]

This transparent decompression idea is based on our experience with HFS+. Apple 
uses the fs-attribute approach. OSX is able to compress application libraries 
at installation-time, apps remain blissfully unaware but get an extra boost in 
startup perf.


HFS+ supports compression as internal filesystem technique. It means that HFS+ 
volume layout has
metadata structures for compression support (compressed xattrs or compressed 
resource forks).
So, compression is supported on FS level. As I know, Mac OS X has native 
decompression support
for compressed files but you need to use special tool for compression of files 
on HFS+. Maybe
Mac OS X has internal library that give opportunity to compress application 
libraries at installation
time. But I suppose that it is simply user-space tool or library that uses HFS+ 
compression support
on kernel-space and volume layout levels.
In addition to what Taras mentioned, there is a similar approach being 
followed here. There is a compression tool to compress files at 
https://github.com/glandium/faulty.lib/blob/master/linker/szip.cpp .


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Dhaval Giani


On 07/24/2013 07:36 PM, Jörn Engel wrote:

On Wed, 24 July 2013 17:03:53 -0400, Dhaval Giani wrote:

I am posting this series early in its development phase to solicit some
feedback.

At this state, a good description of the format would be nice.


Sure. The format is quite simple. There is a 20 byte header followed by 
an offset table giving us the offsets of 16k compressed zlib chunks (The 
16k is the default number, it can be changed with the use of szip tool, 
the kernel should still decompress it as that data is in the header). I 
am not tied to the format. I used it as that is what being used here. My 
final goal is the have the filesystem agnostic of the compression format 
as long as it is seekable.





We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker can
be found at https://github.com/glandium/faulty.lib

It is not quite clear what you want to achieve here.


To introduce transparent decompression. Let someone else do the 
compression for us, and supply decompressed data on demand  (in this 
case a read call). Reduces the complexity which would otherwise have to 
be brought into the filesystem.



   One approach is
to create an empty file, chattr it to enable compression, then write
uncompressed data to it.  Nothing in userspace will ever know the file
is compressed, unless you explicitly call lsattr.

If you want to follow some other approach where userspace has one
interface to write the compressed data to a file and some other
interface to read the file uncompressed, you are likely in a world of
pain.
Why? If it is going to only be a few applications who know the file is 
compressed, and read it to get decompressed data, why would it be 
painful? What about introducing a new flag, O_COMPR which tells the 
kernel, btw, we want this file to be decompressed if it can be. It can 
fallback to O_RDONLY or something like that? That gets rid of the chattr 
ugliness.



Assuming you use the chattr approach, that pretty much comes down to
adding compression support to ext4.  There have been old patches for
ext2 around that never got merged.  Reading up on the problems
encountered by those patches might be instructive.


Do you have subjects for these? When I googled for ext4 compression, I 
found http://code.google.com/p/e4z/ which doesn't seem to exist, and 
checking in my LKML archives gives too many false positives.


Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 1/2] szip: Add seekable zip format

2013-07-24 Thread Dhaval Giani


Add support for inflating seekable zip format. This uses zlib
underneath. In order to create a seekable zip file, use the
szip utility which can be obtained from

https://github.com/glandium/faulty.lib

We shall use this to implement transparent decompression on
ext4. The use would be very similar to that used by the faulty.lib
linker.

Cc: Theodore Ts'o 
Cc: Taras Glek 
Cc: Vladan Djeric 
Cc: linux-ext4 
Cc: LKML 
Cc: linux-fsdevel 
Cc: Mike Hommey 
Signed-off-by: Dhaval Giani 
---
 include/linux/szip.h |  32 
 lib/Kconfig  |   8 ++
 lib/Makefile |   1 +
 lib/szip.c   | 217 +++
 4 files changed, 258 insertions(+)
 create mode 100644 include/linux/szip.h
 create mode 100644 lib/szip.c

diff --git a/include/linux/szip.h b/include/linux/szip.h
new file mode 100644
index 000..1d4421e
--- /dev/null
+++ b/include/linux/szip.h
@@ -0,0 +1,32 @@
+#ifndef __SZIP_H
+#define __SZIP_H
+
+#include 
+#include 
+
+#define SZIP_HEADER_SIZE (20)
+
+struct szip_struct {
+   u32 magic;
+   u32 total_size;
+   u16 chunk_size;
+   u16 dict_size;
+   u32 nr_chunks;
+   u16 last_chunk_size;
+   signed char window_bits;
+   signed char filter;
+   unsigned *offset_table;
+   unsigned *dictionary;
+   char *buffer;
+   void *workspace;
+};
+
+extern int szip_decompress(struct szip_struct *, char *, size_t);
+extern int szip_seekable_decompress(struct szip_struct *, size_t,
+   size_t, char *, size_t);
+extern size_t szip_uncompressed_size(struct szip_struct *);
+extern int szip_init(struct szip_struct *, char *);
+extern void szip_init_offset_table(struct szip_struct *szip, char *buf);
+extern size_t szip_offset_table_size(struct szip_struct *szip);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index fe01d41..0903693 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -213,6 +213,14 @@ config DECOMPRESS_LZO
select LZO_DECOMPRESS
tristate
 
+config SZIP
+   select ZLIB_INFLATE
+   tristate
+   help
+ Use this to provide szip decompression support. szip is a seekable
+ zlib format. Check https://github.com/glandium/faulty.lib for the
+ szip tool. This is required for transparent ext4 decompression.
+
 #
 # Generic allocator support is selected if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index c55a037..86a5d4b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -77,6 +77,7 @@ obj-$(CONFIG_LZO_COMPRESS) += lzo/
 obj-$(CONFIG_LZO_DECOMPRESS) += lzo/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
+obj-${CONFIG_SZIP} += szip.o
 
 lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
 lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
diff --git a/lib/szip.c b/lib/szip.c
new file mode 100644
index 000..d610e62
--- /dev/null
+++ b/lib/szip.c
@@ -0,0 +1,217 @@
+/*
+ * lib/szip.c
+ *
+ * This is a seekable zip file, the format of which is based on
+ * code available at https://github.com/glandium/faulty.lib
+ *
+ * Copyright: Mozilla
+ * Author: Dhaval Giani 
+ *
+ * Based on code written by Mike Hommey  as
+ * part of faulty.lib .
+ *
+ * This code is available under the MPL v2.0 which is explicitly
+ * compatible with GPL v2.
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+
+#define SZIP_MAGIC 0x7a5a6553
+
+static int szip_decompress_seekable_chunk(struct szip_struct *szip,
+   char *output, size_t offset, size_t chunk, size_t length)
+{
+   int is_last_chunk = (chunk == szip->nr_chunks - 1);
+   size_t chunk_len = is_last_chunk ? szip->last_chunk_size
+   : szip->chunk_size;
+   z_stream zstream;
+   int ret = 0;
+   int flush;
+   int success;
+
+   memset(&zstream, 0, sizeof(zstream));
+
+   if (length == 0 || length > chunk_len)
+   length = chunk_len;
+
+   if (is_last_chunk)
+   zstream.avail_in = szip->total_size;
+   else
+   zstream.avail_in = szip->offset_table[chunk + 1]
+   - szip->offset_table[chunk];
+
+   zstream.next_in = szip->buffer + offset;
+   zstream.avail_out = length;
+   zstream.next_out = output;
+   if (!szip->workspace)
+   szip->workspace = vzalloc(zlib_inflate_workspacesize());
+   zstream.workspace = szip->workspace;
+   if (!zstream.workspace) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   /* Decompress Chunk */
+   /* **TODO: Correct return value for bad zlib format** */
+   if (zlib_inflateInit2(&zstream, (int) szip->window_bits) != Z_OK) {
+   ret = -EMEDIUMTYPE;
+   goto out;
+   }
+
+   /* We don't have dictionary logic yet */
+   if (length == chunk_len) {
+   flush = Z_FINISH;
+   success

[RFC/PATCH 2/2] Add rudimentary transparent decompression support to ext4

2013-07-24 Thread Dhaval Giani


Adds basic support for transparently reading compressed
files in ext4.

Lots of issues in this patch
1. It requires a fully read file from disk, no seeking allowed
2. Compressed files give their compressed sizes and not uncompressed
sizes. Therefore cat will return truncated data (since the buffer
isn't big enough)
3. It adds a new file operation. That will be *removed*.
4. Doesn't mmap decompressed data

Cc: Theodore Ts'o 
Cc: Taras Glek 
Cc: Vladan Djeric 
Cc: linux-ext4 
Cc: LKML 
Cc: linux-fsdevel 
Cc: Mike Hommey 
Signed-off-by: Dhaval Giani 
---
 fs/ext4/file.c | 66 ++
 fs/read_write.c|  3 +++
 include/linux/fs.h |  1 +
 3 files changed, 70 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1b4d51..5c9db04 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -31,6 +31,9 @@
 #include "xattr.h"
 #include "acl.h"
 
+#include 
+#include 
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -623,6 +626,68 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int 
whence)
return -EINVAL;
 }
 
+static int ext4_is_file_compressed(struct file *file)
+{
+   struct inode *inode = file->f_mapping->host;
+   return ext4_test_inode_flag(inode, EXT4_INODE_COMPR);
+}
+
+static int _ext4_decompress(char __user *buf, int sz)
+{
+   /*
+* We can really cheat here since we have the full buffer already read
+* and made available
+*/
+   struct szip_struct szip;
+   char *temp;
+   size_t uncom_size;
+
+   int ret = szip_init(&szip, buf);
+   if (ret) {
+   ret = -1;
+   goto out;
+   }
+
+   uncom_size = szip_uncompressed_size(&szip);
+   temp = kmalloc(uncom_size, GFP_NOFS);
+   if (!temp) {
+   ret = -2;
+   goto out;
+   }
+
+   ret = szip_decompress(&szip, temp, 0);
+   if (ret) {
+   ret = -3;
+   goto out_free;
+   }
+
+   sz = min_t(int, sz, uncom_size);
+
+   memset(buf, 0, sz);
+   memcpy(buf, temp, sz);
+out_free:
+   kfree(temp);
+
+out:
+   return ret;
+
+}
+
+int ext4_decompress(struct file *file, char __user *buf, size_t len)
+{
+   int ret = 0;
+
+   if (!ext4_is_file_compressed(file))
+   return 0;
+
+   ret = _ext4_decompress(buf, len);
+   if (ret) {
+   goto out;
+   }
+out:
+   return ret;
+}
+
 const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
.read   = do_sync_read,
@@ -640,6 +705,7 @@ const struct file_operations ext4_file_operations = {
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
.fallocate  = ext4_fallocate,
+   .decompress = ext4_decompress,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 2cefa41..44d2523 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -330,6 +330,7 @@ int rw_verify_area(int read_write, struct file *file, 
loff_t *ppos, size_t count
return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
 }
 
+
 ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t 
*ppos)
 {
struct iovec iov = { .iov_base = buf, .iov_len = len };
@@ -345,6 +346,8 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, 
size_t len, loff_t *pp
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
+   if (filp->f_op->decompress)
+   filp->f_op->decompress(filp, buf, len);
return ret;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65c2be2..ce43e82 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,7 @@ struct file_operations {
long (*fallocate)(struct file *file, int mode, loff_t offset,
  loff_t len);
int (*show_fdinfo)(struct seq_file *m, struct file *f);
+   int (*decompress)(struct file *, char *, size_t);
 };
 
 struct inode_operations {
-- 
1.8.1.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-24 Thread Dhaval Giani


Hi there!

I am posting this series early in its development phase to solicit some
feedback.

We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker can
be found at https://github.com/glandium/faulty.lib

Patch 1 introduces the seekable zip format to the kernel. The tool to
create the szip file can be found in the git repository mentioned
earlier. Patch 2 introduces transparent decompression to ext4. This
patch is really ugly, but it gives an idea of what I am upto right now.

Now let's move on the interesting bits.

There are a few flaws with the current approach (though most are easily
fixable)
1. The decompression takes place very late. We probably want to be
decompressing soon after get the data off disk.
2. No seek support. This is for simplicity as I was experimenting with
filesystems for the first time. I have a patch that does it, but it is
too ugly to see the world. I will fix it up in time for the next set.
3. No mmap support. For a similar reason as 1. There is no reason it
cannot be done, it just has not been done correctly.
4. stat still returns the compressed size. We need to modify
compressed files to return uncompressed size.
5. Implementation is tied to the szip format. However it is quite easy
to decouple the compression scheme from the filesystem. I will probably
get to that in another 2 rounds (first goal is to get seek support
working fine, and mmap in place)
6. Introduction of an additional file_operation to decompress the
buffer. This will be *removed* in the next posting once I have seek
support implemented properly.
7. The compressed file is read only. In order to write to the file, it
shall have to be replaced.
8. The kernel learns that the file is compressed with the use of the
chattr tool. For now I am abusing the +c flag. Please let me know if
that should not be used.

In order to try this patch out, please create an szip file using the
szip tool. Then, read the file. Just ensure that the buffer you provide
to the kernel is big enough to fit the uncompressed file (and that you
read the whole file in one go.)

Thanks!
Dhaval

--
Dhaval Giani (2):
  szip: Add seekable zip format
  Add rudimentary transparent decompression support to ext4

 fs/ext4/file.c   |  66 
 fs/read_write.c  |   3 +
 include/linux/fs.h   |   1 +
 include/linux/szip.h |  32 
 lib/Kconfig  |   8 ++
 lib/Makefile |   1 +
 lib/szip.c   | 217 +++
 7 files changed, 328 insertions(+)
 create mode 100644 include/linux/szip.h
 create mode 100644 lib/szip.c

-- 
1.8.1.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] vrange: Add new vrange(2) system call

2013-06-20 Thread Dhaval Giani


On 2013-06-12 12:22 AM, John Stultz wrote:

From: Minchan Kim 

This patch adds new system call sys_vrange.

NAME
vrange - Mark or unmark range of memory as volatile

SYNOPSIS
int vrange(unsigned_long start, size_t length, int mode,
 int *purged);

DESCRIPTION
Applications can use vrange(2) to advise the kernel how it should
handle paging I/O in this VM area.  The idea is to help the kernel
discard pages of vrange instead of reclaiming when memory pressure
happens. It means kernel doesn't discard any pages of vrange if
there is no memory pressure.

mode:
VRANGE_VOLATILE
hint to kernel so VM can discard in vrange pages when
memory pressure happens.
VRANGE_NONVOLATILE
hint to kernel so VM doesn't discard vrange pages
any more.

If user try to access purged memory without VRANGE_NOVOLATILE call,
he can encounter SIGBUS if the page was discarded by kernel.


I wonder if it would be possible to provide additional information here, 
for example "purge range at a time" as opposed to "purge page at a 
time". There are some valid use cases for both approaches and it doesn't 
make sense to deny one use case.


Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-19 Thread Dhaval Giani


On 2013-06-19 12:41 AM, Minchan Kim wrote:

Hello Dhaval,

On Tue, Jun 18, 2013 at 12:59:02PM -0400, Dhaval Giani wrote:

On 2013-06-18 12:11 AM, Minchan Kim wrote:

Hello Dhaval,

On Mon, Jun 17, 2013 at 12:24:07PM -0400, Dhaval Giani wrote:

Hi John,

I have been giving your git tree a whirl, and in order to simulate a
limited memory environment, I was using memory cgroups.

The program I was using to test is attached here. It is your test
code, with some changes (changing the syscall interface, reducing
the memory pressure to be generated).

I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted

Thanks for the testing!
Does below patch fix your problem?

Yes it does! Thank you very much for the patch.

Thaks for the confirming.
While I tested it, I found several problems so I just sent fixes as reply
of each [7/8] and [8/8].
Could you test it?


Great! These patches (seem to) fix another issue I noticed yesterday 
with signal handling. I have pushed out my code for testing this stuff 
at https://github.com/volatile-ranges-test/vranges-test . The code and 
the scripts are still unpolished (as in you don't get a pass or fail) 
but they seem to work just fine.




FYI: John, Dhaval

I am working to clean purging mess up so maybe it would need not a few
change for purging part.


Great, I will also take a look at the code.

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-18 Thread Dhaval Giani


On 2013-06-18 12:11 AM, Minchan Kim wrote:

Hello Dhaval,

On Mon, Jun 17, 2013 at 12:24:07PM -0400, Dhaval Giani wrote:

Hi John,

I have been giving your git tree a whirl, and in order to simulate a
limited memory environment, I was using memory cgroups.

The program I was using to test is attached here. It is your test
code, with some changes (changing the syscall interface, reducing
the memory pressure to be generated).

I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted

Thanks for the testing!
Does below patch fix your problem?


Yes it does! Thank you very much for the patch.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] Volatile Ranges (v8?)

2013-06-17 Thread Dhaval Giani


Hi John,

I have been giving your git tree a whirl, and in order to simulate a 
limited memory environment, I was using memory cgroups.


The program I was using to test is attached here. It is your test code, 
with some changes (changing the syscall interface, reducing the memory 
pressure to be generated).


I trapped it in a memory cgroup with 1MB memory.limit_in_bytes and hit this,

[  406.207612] [ cut here ]
[  406.207621] kernel BUG at mm/vrange.c:523!
[  406.207626] invalid opcode:  [#1] SMP
[  406.207631] Modules linked in:
[  406.207637] CPU: 0 PID: 1579 Comm: volatile-test Not tainted 
3.10.0-rc5+ #2
[  406.207650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  406.207655] task: 880006fe ti: 88001c8b task.ti: 
88001c8b
[  406.207659] RIP: 0010:[] [] 
try_to_discard_one+0x1f8/0x210

[  406.207667] RSP: :88001c8b1598  EFLAGS: 00010246
[  406.207671] RAX:  RBX: 7fde082c RCX: 
88001f199600
[  406.207675] RDX: 0006 RSI: 0007 RDI: 

[  406.207679] RBP: 88001c8b15f8 R08: 0591 R09: 
0055
[  406.207683] R10:  R11:  R12: 
ea2ae2c0
[  406.207687] R13: 88001ef9e540 R14: 88001ef9e5e0 R15: 
88000b7cfda8
[  406.207692] FS:  7fde08320740() GS:88001fc0() 
knlGS:

[  406.207696] CS:  0010 DS:  ES:  CR0: 8005003b
[  406.207700] CR2: 7fde082c CR3: 1f131000 CR4: 
06f0
[  406.207707] DR0:  DR1:  DR2: 

[  406.207711] DR3:  DR6: 0ff0 DR7: 
0400

[  406.207715] Stack:
[  406.207719]  0006 88001f199600 88001ef9e5d8 
81154f16
[  406.207724]  8801 ea7c6670 88001c8b15f8 
ea2ae2c0
[  406.207729]  88001f1386c0 88001ef9e5d8 88000b7cfda8 
880005110a10

[  406.207734] Call Trace:
[  406.207743]  [] discard_vpage+0x3c2/0x410
[  406.207753]  [] ? page_referenced+0x241/0x2c0
[  406.207762]  [] shrink_page_list+0x397/0x950
[  406.207770]  [] shrink_inactive_list+0x14f/0x400
[  406.207778]  [] shrink_lruvec+0x229/0x4e0
[  406.207787]  [] ? wake_up_process+0x27/0x50
[  406.207795]  [] shrink_zone+0x66/0x1a0
[  406.207803]  [] do_try_to_free_pages+0x110/0x5a0
[  406.207812]  [] try_to_free_mem_cgroup_pages+0xbf/0x140
[  406.207821]  [] mem_cgroup_reclaim+0x4e/0xe0
[  406.207829]  [] __mem_cgroup_try_charge+0x4ef/0xbb0
[  406.207837]  [] mem_cgroup_charge_common+0x6d/0xd0
[  406.207846]  [] mem_cgroup_newpage_charge+0x3b/0x50
[  406.207854]  [] do_wp_page+0x150/0x720
[  406.207862]  [] handle_pte_fault+0x98d/0xae0
[  406.207871]  [] handle_mm_fault+0x264/0x5e0
[  406.207880]  [] __do_page_fault+0x171/0x4e0
[  406.207888]  [] ? do_page_fault+0xe/0x10
[  406.207896]  [] ? page_fault+0x22/0x30
[  406.207905]  [] do_page_fault+0xe/0x10
[  406.207913]  [] page_fault+0x22/0x30
[  406.207917] Code: c1 e7 39 48 09 c7 f0 49 ff 8d e8 02 00 00 48 89 55 
a0 48 89 4d a8 e8 78 42 00 00 85 c0 48 8b 55 a0 48 8b 4d a8 0f 85 50 ff 
ff ff <0f> 0b 66 0f 1f 44 00 00 31 db e9 7a fe ff ff 0f 0b e8 c1 aa 4b

[  406.207937] RIP  [] try_to_discard_one+0x1f8/0x210
[  406.207941]  RSP 
[  406.207946] ---[ end trace fe9729b910a78aff ]---
[  406.207951] [ cut here ]
[  406.207957] WARNING: at kernel/exit.c:715 do_exit+0x55/0xa30()
[  406.207960] Modules linked in:
[  406.207965] CPU: 0 PID: 1579 Comm: volatile-test Tainted: G D  
3.10.0-rc5+ #2
[  406.207969] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
VirtualBox 12/01/2006
[  406.207973]  0009 88001c8b1288 81612a03 
88001c8b12c8
[  406.207978]  81049bb0 88001c8b14e8 000b 
88001c8b14e8
[  406.207983]  0246  880006fe 
88001c8b12d8

[  406.207988] Call Trace:
[  406.207997]  [] dump_stack+0x19/0x1b
[  406.208189]  [] warn_slowpath_common+0x70/0xa0
[  406.208207]  [] warn_slowpath_null+0x1a/0x20
[  406.208222]  [] do_exit+0x55/0xa30
[  406.208238]  [] ? printk+0x61/0x63
[  406.208253]  [] oops_end+0x9b/0xe0
[  406.208269]  [] die+0x58/0x90
[  406.208285]  [] do_trap+0x6b/0x170
[  406.208298]  [] ? 
__atomic_notifier_call_chain+0x12/0x20

[  406.208309]  [] do_invalid_op+0x95/0xb0
[  406.208317]  [] ? try_to_discard_one+0x1f8/0x210
[  406.208328]  [] ? blk_queue_bio+0x32e/0x3b0
[  406.208338]  [] invalid_op+0x18/0x20
[  406.208348]  [] ? try_to_discard_one+0x1f8/0x210
[  406.208360]  [] ? try_to_discard_one+0x1e8/0x210
[  406.208370]  [] discard_vpage+0x3c2/0x410
[  406.208383]  [] ? page_referenced+0x241/0x2c0
[  406.208394]  [] shrink_page_list+0x397/0x950
[  406.208405]  [] shrink_inactive_list+0x14f/0x400
[  406.208417]  [] shrink_lruvec+0x229/0x4e0
[  406.208429]  [] ? wake_up_process+0x27/0x50
[  406

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-31 Thread Dhaval Giani

On Wed, Oct 31, 2012 at 3:12 AM, Namhyung Kim  wrote:
> On Tue, 30 Oct 2012 08:05:45 -0400, Dhaval Giani wrote:
>> On Tue, Oct 30, 2012 at 3:42 AM, Namhyung Kim  wrote:
>>> Hi Dhaval,
>>>
>>> On Mon, 29 Oct 2012 12:45:53 -0400, Dhaval Giani wrote:
>>>> On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani  
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> As part of a class assignment I have to collect some performance
>>>>> statistics. In order to do so I run
>>>>>
>>>>> perf record -g 
>>>>>
>>>>> And in another window, I start 200 threads of the load generator
>>>>> (which is not recorded by perf)
>>>>>
>>>>> This generates me statistics that I expect to see, and I am happy. As
>>>>> this is academia and a class assignment, I need to collect information
>>>>> and analyze it across different setups. Which of course meant I script
>>>>> this whole thing, which basically is
>>>>>
>>>>> for i in all possibilities
>>>>> do
>>>>> perf record -g  &
>>>>> WAITPID=$!
>>>>> for j in NR_THREADS
>>>>> do
>>>>>  &
>>>>> KILLPID=$!
>>>>> done
>>>>> wait $PID
>>>
>>> You meant $WAITPID, right?
>>>
>>
>> yes. grrr. I changed the name here to WAITPID for it to be clear and
>> that was a fail. (I blame the cold)
>>
>>>
>>>>> kill $KILLPID
>>>
>>> Doesn't it kill the last load generator only?
>>>
>>>
>>
>> Well, this was a bug in me typing the pseudo code. the actual script
>> does "$KILLPID $!"
>
> Okay, so I suspect that it might be affected by the autogroup scheduling
> feature since you said running load generators in another window - I
> guess it's a terminal.  How about running them with setsid?
>

Why would that affect the data collection for the program being
profiled? The time spent (since it is a compute intensive program) in
various functions shouldn't change, correct? (Unless I am missing
something).

/me goes and tries it out

Hmm. OK, so that doesn't help. Still the same.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-30 Thread Dhaval Giani

On Tue, Oct 30, 2012 at 3:42 AM, Namhyung Kim  wrote:
> Hi Dhaval,
>
> On Mon, 29 Oct 2012 12:45:53 -0400, Dhaval Giani wrote:
>> On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani  
>> wrote:
>>> Hi,
>>>
>>> As part of a class assignment I have to collect some performance
>>> statistics. In order to do so I run
>>>
>>> perf record -g 
>>>
>>> And in another window, I start 200 threads of the load generator
>>> (which is not recorded by perf)
>>>
>>> This generates me statistics that I expect to see, and I am happy. As
>>> this is academia and a class assignment, I need to collect information
>>> and analyze it across different setups. Which of course meant I script
>>> this whole thing, which basically is
>>>
>>> for i in all possibilities
>>> do
>>> perf record -g  &
>>> WAITPID=$!
>>> for j in NR_THREADS
>>> do
>>>  &
>>> KILLPID=$!
>>> done
>>> wait $PID
>
> You meant $WAITPID, right?
>

yes. grrr. I changed the name here to WAITPID for it to be clear and
that was a fail. (I blame the cold)

>
>>> kill $KILLPID
>
> Doesn't it kill the last load generator only?
>
>

Well, this was a bug in me typing the pseudo code. the actual script
does "$KILLPID $!"

Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] perf report: different reports when run on terminal as opposed to script

2012-10-29 Thread Dhaval Giani

On Mon, Oct 29, 2012 at 12:01 PM, Dhaval Giani  wrote:
> Hi,
>
> As part of a class assignment I have to collect some performance
> statistics. In order to do so I run
>
> perf record -g 
>
> And in another window, I start 200 threads of the load generator
> (which is not recorded by perf)
>
> This generates me statistics that I expect to see, and I am happy. As
> this is academia and a class assignment, I need to collect information
> and analyze it across different setups. Which of course meant I script
> this whole thing, which basically is
>
> for i in all possibilities
> do
> perf record -g  &
> WAITPID=$!
> for j in NR_THREADS
> do
>  &
> KILLPID=$!
> done
> wait $PID
> kill $KILLPID
> mv perf.data results/perf.data.$i
> done
>
> (This is basic pseudo script of what I am doing), which results me
> having my profile being topped by _vscanf() and the function which I
> was seeing dominating in the older report dropping down to something
> like 5% (as opposed to 16-17%)
>
> Have I misunderstood how perf works? Something deeper? I am currently
> on 3.6.3. I can update to the latest upstream and report back. Any
> debug code is very welcome. I can also make my toy program and the
> scripts available for you to try out.

I just updated to 6b0cb4eef7bdaa27b8021ea81813fba330a2d94d and I still
see this happen.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG] perf report: different reports when run on terminal as opposed to script

2012-10-29 Thread Dhaval Giani

Hi,

As part of a class assignment I have to collect some performance
statistics. In order to do so I run

perf record -g 

And in another window, I start 200 threads of the load generator
(which is not recorded by perf)

This generates me statistics that I expect to see, and I am happy. As
this is academia and a class assignment, I need to collect information
and analyze it across different setups. Which of course meant I script
this whole thing, which basically is

for i in all possibilities
do
perf record -g  &
WAITPID=$!
for j in NR_THREADS
do
 &
KILLPID=$!
done
wait $PID
kill $KILLPID
mv perf.data results/perf.data.$i
done

(This is basic pseudo script of what I am doing), which results me
having my profile being topped by _vscanf() and the function which I
was seeing dominating in the older report dropping down to something
like 5% (as opposed to 16-17%)

Have I misunderstood how perf works? Something deeper? I am currently
on 3.6.3. I can update to the latest upstream and report back. Any
debug code is very welcome. I can also make my toy program and the
scripts available for you to try out.

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] RCU documentation: Correct the name of a reference

2012-10-17 Thread Dhaval Giani

[Attaching the patch as gmail breaks the patches]

Trying to go through the history of RCU (not for the weak
minded) led me to search for a non-existent paper.

Correct it to the actual reference

Signed-off-by: Dhaval Giani 
Cc: Paul McKenney 
Cc: Peter Zijlstra 

Index: linux/Documentation/RCU/RTFP.txt
===
--- linux.orig/Documentation/RCU/RTFP.txt
+++ linux/Documentation/RCU/RTFP.txt
@@ -186,7 +186,7 @@ Bibtex Entries

 @article{Kung80
 ,author="H. T. Kung and Q. Lehman"
-,title="Concurrent Maintenance of Binary Search Trees"
+,title="Concurrent Manipulation of Binary Search Trees"
 ,Year="1980"
 ,Month="September"
 ,journal="ACM Transactions on Database Systems"


rcu-doc-fix.patch
Description: Binary data

Re: [RFC] cgroup TODOs

2012-09-14 Thread Dhaval Giani

>
>   * Sort & unique when listing tasks.  Even the documentation says it
> doesn't happen but we have a good hunk of code doing it in
> cgroup.c.  I'm gonna rip it out at some point.  Again, if you
> don't like it, scream.
>

I think some userspace tools do assume the uniq bit. So if we can
preserve that, great!

Thanks
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/5] forced comounts for cgroups.

2012-09-08 Thread Dhaval Giani

On Thu, Sep 6, 2012 at 5:11 PM, Paul Turner  wrote:
> On Thu, Sep 6, 2012 at 1:46 PM, Tejun Heo  wrote:
>> Hello,
>>
>> cc'ing Dhaval and Frederic.  They were interested in the subject
>> before and Dhaval was pretty vocal about cpuacct having a separate
>> hierarchy (or at least granularity).
>
> Really?  Time just has _not_ borne out this use-case.  I'll let Dhaval
> make a case for this but he should expect violent objection.
>

I am not objecting directly! I am aware of a few users who are (or at
least were) using cpu and cpuacct separately because they want to be
able to account without control. Having said that, there are tons of
flaws in the current approach, because the accounting without control
is just plain wrong. I have copied a few other folks who might be able
to shed light on those users and if we should still consider them.

[And the lesser number of controllers, the better it is!]

Thanks!
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: revert load_balance_monitor()

2008-02-25 Thread Dhaval Giani

On Mon, Feb 25, 2008 at 03:29:59PM +0100, Mike Galbraith wrote:
> 
> On Mon, 2008-02-25 at 13:22 +0100, Peter Zijlstra wrote:
> > Subject: sched: revert load_balance_monitor()
> > 
> > The following commit causes a number of serious regressions:
> > 
> >   commit 6b2d7700266b9402e12824e11e0099ae6a4a6a79
> >   Author: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
> >   Date:   Fri Jan 25 21:08:00 2008 +0100
> >   sched: group scheduler, fix fairness of cpu bandwidth allocation for task 
> > groups
> > 
> > Namely:
> >  - very frequent wakeups on SMP, reported by PowerTop users.
> >  - cacheline trashing on (large) SMP
> >  - some latencies larger than 500ms
> > 
> > While there is a mergeable patch to fix the latter, the former issues
> > are IMHO not fixable in a manner suitable for .25 (we're at -rc3 now).
> > Hence I propose to revert this patch and try again for .26.
> > 
> > ( minimal revert - leaves most of the code present, just removes the 
> > activation
> >   and sysctl interface ).
> 
> top - 14:05:56 up 3 min, 16 users,  load average: 4.31, 2.14, 0.85
> Tasks: 218 total,   5 running, 213 sleeping,   0 stopped,   0 zombie
> Cpu(s): 35.5%us, 64.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
>  5294 mikeg 20   0  1464  364  304 R   99  0.0   1:00.08 0 chew-max
>  5278 root  20   0  1464  364  304 R   32  0.0   0:27.86 1 chew-max
>  5279 root  20   0  1464  360  304 R   32  0.0   0:35.53 1 chew-max
>  5290 root  20   0  1464  364  304 R   31  0.0   0:29.00 1 chew-max
> 
> The minimal revert seems to leave group fairness in a worse state than
> what the original patch meant to fix.  Maybe a full revert would be
> better?
> 

This is funny. The thread should not start. Did the full revert that I
sent you sometime back work better?

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels

2008-02-25 Thread Dhaval Giani

This patch makes the group scheduler multi hierarchy aware.

Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>

---
 include/linux/sched.h |2 +-
 kernel/sched.c|   41 -
 2 files changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6.25-rc2/include/linux/sched.h
===
--- linux-2.6.25-rc2.orig/include/linux/sched.h
+++ linux-2.6.25-rc2/include/linux/sched.h
@@ -2031,7 +2031,7 @@ extern void normalize_rt_tasks(void);
 
 extern struct task_group init_task_group;
 
-extern struct task_group *sched_create_group(void);
+extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6.25-rc2/kernel/sched.c
===
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -7155,10 +7155,11 @@ static void init_rt_rq(struct rt_rq *rt_
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
-   struct cfs_rq *cfs_rq, struct sched_entity *se,
-   int cpu, int add)
+static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
+   struct sched_entity *se, int cpu, int add,
+   struct sched_entity *parent)
 {
+   struct rq *rq = cpu_rq(cpu);
tg->cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
cfs_rq->tg = tg;
@@ -7170,7 +7171,11 @@ static void init_tg_cfs_entry(struct rq 
if (!se)
return;
 
-   se->cfs_rq = &rq->cfs;
+   if (parent == NULL)
+   se->cfs_rq = &rq->cfs;
+   else
+   se->cfs_rq = parent->my_q;
+
se->my_q = cfs_rq;
se->load.weight = tg->shares;
se->load.inv_weight = div64_64(1ULL<<32, se->load.weight);
@@ -7244,7 +7249,8 @@ void __init sched_init(void)
 * We achieve this by letting init_task_group's tasks sit
 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 */
-   init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
+   init_tg_cfs_entry(&init_task_group, &rq->cfs,
+   NULL, i, 1, NULL);
init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
 #elif defined CONFIG_USER_SCHED
/*
@@ -7260,7 +7266,7 @@ void __init sched_init(void)
 */
init_tg_cfs_entry(rq, &init_task_group,
&per_cpu(init_cfs_rq, i),
-   &per_cpu(init_sched_entity, i), i, 1);
+   &per_cpu(init_sched_entity, i), i, 1, NULL);
 
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -7630,7 +7636,8 @@ static void free_fair_sched_group(struct
kfree(tg->se);
 }
 
-static int alloc_fair_sched_group(struct task_group *tg)
+static int alloc_fair_sched_group(struct task_group *tg,
+   struct task_group *parent)
 {
struct cfs_rq *cfs_rq;
struct sched_entity *se;
@@ -7658,8 +7665,11 @@ static int alloc_fair_sched_group(struct
GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
if (!se)
goto err;
-
-   init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
+   if (!parent) {
+   init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent->se[i]);
+   } else {
+   init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
+   }
}
 
return 1;
@@ -7788,7 +7798,7 @@ static void free_sched_group(struct task
 }
 
 /* allocate runqueue etc for a new task group */
-struct task_group *sched_create_group(void)
+struct task_group *sched_create_group(struct task_group *parent)
 {
struct task_group *tg;
unsigned long flags;
@@ -7798,7 +7808,7 @@ struct task_group *sched_create_group(vo
if (!tg)
return ERR_PTR(-ENOMEM);
 
-   if (!alloc_fair_sched_group(tg))
+   if (!alloc_fair_sched_group(tg, parent))
goto err;
 
if (!alloc_rt_sched_group(tg))
@@ -8049,7 +8059,7 @@ static inline struct task_group *cgroup_
 static struct cgroup_subsys_state *
 cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-   struct task_group *tg;
+   struct task_group *tg, *parent;
 
if (!cgrp->parent) {
/* This is early initialization for the top cgroup */
@@ -8057,11 +8067,8 @@ cpu_cgroup_create(struct cgroup_subsys *
return &ini

Re: [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels

2008-02-25 Thread Dhaval Giani

Meant 2/2 in $subject.
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler

2008-02-25 Thread Dhaval Giani

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>
Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>
---
 kernel/sched.c  |   46 +
 kernel/sched_fair.c |  113 +---
 2 files changed, 137 insertions(+), 22 deletions(-)

Index: linux-2.6.25-rc2/kernel/sched.c
===
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -224,10 +224,13 @@ struct task_group {
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
+#ifdef CONFIG_USER_SCHED
 /* Default task group's sched entity on each cpu */
 static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
 /* Default task group's cfs_rq on each cpu */
 static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) cacheline_aligned_in_smp;
+#endif
 
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
@@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
tg->se[cpu] = se;
+   /* se could be NULL for init_task_group */
+   if (!se)
+   return;
+
se->cfs_rq = &rq->cfs;
se->my_q = cfs_rq;
se->load.weight = tg->shares;
@@ -7217,11 +7224,46 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+#ifdef CONFIG_CGROUP_SCHED
+   /*
+* How much cpu bandwidth does init_task_group get?
+*
+* In case of task-groups formed thr' the cgroup filesystem, it
+* gets 100% of the cpu resources in the system. This overall
+* system cpu resource is divided among the tasks of
+* init_task_group and its child task-groups in a fair manner,
+* based on each entity's (task or task-group's) weight
+* (se->load.weight).
+*
+* In other words, if init_task_group has 10 tasks of weight
+* 1024) and two child groups A0 and A1 (of weight 1024 each),
+* then A0's share of the cpu resource is:
+*
+*  A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
+*
+* We achieve this by letting init_task_group's tasks sit
+* directly in rq->cfs (i.e init_task_group->se[] = NULL).
+*/
+   init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
+   init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
+#elif defined CONFIG_USER_SCHED
+   /*
+* In case of task-groups formed thr' the user id of tasks,
+* init_task_group represents tasks belonging to root user.
+* Hence it forms a sibling of all subsequent groups formed.
+* In this case, init_task_group gets only a fraction of overall
+* system cpu resource, based on the weight assigned to root
+* user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
+* by letting tasks of init_task_group sit in a separate cfs_rq
+* (init_cfs_rq) and having one entity represent this group of
+* tasks in rq->cfs (i.e init_task_group->se[] != NULL).
+*/
init_tg_cfs_entry(rq, &init_task_group,
&per_cpu(init_cfs_rq, i),
&per_cpu(init_sched_entity, i), i, 1);
 
 #endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
init_task_group.rt_runtime =
sysctl_sched_rt_runtime * NSEC_PER_USEC;
@@ -7435,6 +7477,10 @@ static int rebalance_shares(struct sched
unsigned long total_load = 0, total_shares;
struct task_group *tg = cfs_rq->tg;
 
+   /* Skip this group if there is no associated group entity */
+   if (unlikely(!tg->se[this_cpu]))
+   continue;
+
/* Gather total task load of this group across cpus */
for_each_cpu_mask(i, sdspan)
total_load += tg->cfs_rq[i]->load.weight;
Index: linux-2.6.25-rc2/kernel/sched_fair.c
===
--- linux-2.6.25-rc2.orig/kernel/sched_fair.c
+++ linux-2.6.25-rc2/kernel/sched_fair.c
@@ -732,6 +73

[RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler

2008-02-25 Thread Dhaval Giani

Hi Ingo,

These patches change the fairness model as discussed in
http://lkml.org/lkml/2008/1/30/634

Patch 1 -> Changes the fairness model
Patch 2 -> Allows one to create multiple levels of cgroups

The second patch is not very good with SMP yet, that is the next TODO.
Also it changes the behaviour of fair user. The root task group is the
parent task group and the other users are its children.

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ftrace causing panics.

2008-02-20 Thread Dhaval Giani

On Wed, Feb 20, 2008 at 10:02:18AM -0500, Steven Rostedt wrote:
> Dhaval Giani wrote:
>> Hi Ingo,
>>
>> ftrace-cmd in -w option when being run for sometime cause this.
>>
>>
>> llm11.in.ibm.com login: [ 1002.937490] BUG: unable to handle kernel paging 
>> request at 285b0010
>> [ 1002.947087] IP: [] find_next_entry+0x4f/0x84
>>
>
> Dhaval,
>
> First, thanks for testing
>

If it helps solve difficult problems, its the best tool ever! :)

> Are you running the -mm kernel or sched-devel?  This will let me know which 
> version you have.  I'm working on a queue of fixes for Ingo now, to 
> incorporate into sched-devel (and later pass to Andrew for -mm).  I'm not 
> sure if the new fixes will help you, but we need to get in sync, so that we 
> are both looking at the same version of the code.
>

sched-devel as of yesterday. (I don't think anything new has gone in
today).

[sorry, not had enough time to get to the bottom of this the last few
days]

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ftrace causing panics.

2008-02-19 Thread Dhaval Giani

Hi Ingo,

ftrace-cmd in -w option when being run for sometime cause this.


llm11.in.ibm.com login: [ 1002.937490] BUG: unable to handle kernel paging 
request at 285b0010
[ 1002.947087] IP: [] find_next_entry+0x4f/0x84
[ 1002.955091] *pdpt = 2d589001 *pde =  
[ 1002.963651] Oops:  [#1] SMP 
[ 1002.967082] Modules linked in:
[ 1002.967082] 
[ 1002.967082] Pid: 16350, comm: cat Not tainted (2.6.25-rc2-sched-devel #9)
[ 1002.967082] EIP: 0060:[] EFLAGS: 00010206 CPU: 0
[ 1002.967082] EIP is at find_next_entry+0x4f/0x84
[ 1002.967082] EAX: f6db2c60 EBX: 0001 ECX: 0001 EDX: f6db2c60
[ 1002.967082] ESI: 285b EDI: c0850e00 EBP: eed23f04 ESP: eed23eec
[ 1002.967082]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 1002.967082] Process cat (pid: 16350, ti=eed22000 task=f0164620 
task.ti=eed22000)
[ 1002.967082] Stack:  eed23f0c f61ba550 f61ba550 f61ba550 eed23f54 
eed23f18 c015f806 
[ 1002.967082] f61ba550 f61ba550 eed23f38 c015f8ae 7334 
f6dccfc0 f61ba5e8 
[ 1002.967082]c0582970 f61ba5e8 f61ba550 eed23f70 c01998a8 0f96 
 1000 
[ 1002.967082] Call Trace:
[ 1002.967082]  [] ? find_next_entry_inc+0x1c/0x80
[ 1002.967082]  [] ? s_next+0x44/0x7e
[ 1002.967082]  [] ? seq_read+0x176/0x252
[ 1002.967082]  [] ? vfs_read+0x90/0x108
[ 1002.967082]  [] ? sys_read+0x40/0x65
[ 1002.967082]  [] ? sysenter_past_esp+0x5f/0x99
[ 1002.967082]  ===
[ 1002.967082] Code: e8 2d b2 0e 00 83 f8 07 89 c3 7f 3c 8b 54 87 14 83 3a 00 
74 25 50 8b 4d f0 89 f8 e8 41 ff ff ff 59 85 c0 89 c2 74 13 85 f6 74 0a <8b> 46 
10 2b 42 10 8 
[ 1002.967082] EIP: [] find_next_entry+0x4f/0x84 SS:ESP 0068:eed23eec
[ 1002.967200] Kernel panic - not syncing: Fatal exception

I can send you complete dmesg offlist as it has only trace data. .config
is the same as before (expect CONFIG_SMP is now on)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ftrace and kexec

2008-02-19 Thread Dhaval Giani

On Tue, Feb 19, 2008 at 03:22:39PM +0100, Ingo Molnar wrote:
> 
> * Dhaval Giani <[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > I've been running ftrace on the sched-devel tree. I just built a 
> > kernel and tried rebooting using kexec and I get this,
> 
> hm, it's not a good idea to keep using the data structures of the tracer 
> while we kexec. Does the patch below resolve the problem?
> 

It boots, but I get this now


[0.296073] [ cut here ]
[0.300018] WARNING: at kernel/lockdep.c:2689 check_flags+0xf6/0x10b()
[0.300018] Modules linked in:
[0.300018] Pid: 1, comm: swapper Not tainted 2.6.25-rc2-sched-devel #8
[0.300018]  [] warn_on_slowpath+0x46/0x60
[0.300018]  [] ? very_verbose+0x8/0xc
[0.300018]  [] ? check_chain_key+0xe/0x1a3
[0.300018]  [] ? very_verbose+0x8/0xc
[0.300018]  [] ? very_verbose+0x8/0xc
[0.300018]  [] ? check_chain_key+0xe/0x1a3
[0.300018]  [] ? mcount_call+0x5/0x9
[0.300018]  [] ? check_chain_key+0xe/0x1a3
[0.300019]  [] ? __lock_acquire+0x614/0x668
[0.300019]  [] ? check_chain_key+0xe/0x1a3
[0.300019]  [] ? ftrace_record_ip+0x124/0x130
[0.300019]  [] ? mcount_call+0x5/0x9
[0.300019]  [] ? debug_locks_off+0x8/0x40
[0.300019]  [] ? mcount_call+0x5/0x9
[0.300019]  [] check_flags+0xf6/0x10b
[0.300019]  [] lock_acquire+0x34/0x80
[0.300019]  [] _spin_lock_irqsave+0x27/0x37
[0.300019]  [] ? ftrace_record_ip+0xac/0x130
[0.300019]  [] ? do_softirq+0x2f/0x47
[0.300019]  [] ftrace_record_ip+0xac/0x130
[0.300019]  [] ? trace_softirqs_off+0x8/0xaa
[0.300019]  [] ? do_softirq+0x2f/0x47
[0.300019]  [] mcount_call+0x5/0x9
[0.300019]  [] ? do_softirq+0x2f/0x47
[0.300019]  [] ? trace_softirqs_off+0x8/0xaa
[0.300019]  [] __local_bh_disable+0x7d/0x83
[0.300019]  [] __do_softirq+0x1e/0x97
[0.300019]  [] do_softirq+0x2f/0x47
[0.300019]  [] irq_exit+0x3c/0x3e
[0.300019]  [] smp_apic_timer_interrupt+0x32/0x3b
[0.300019]  [] apic_timer_interrupt+0x2d/0x34
[0.300019]  [] ? release_console_sem+0xc0/0xda
[0.300019]  [] ? sys_unshare+0x9c/0x2a8
[0.300019]  [] ? vprintk+0x24b/0x256
[0.300019]  [] ? trace_hardirqs_on+0xb/0xd
[0.300019]  [] ? ftrace_record_ip+0x124/0x130
[0.300019]  [] ? printk+0x8/0x16
[0.300019]  [] printk+0x14/0x16
[0.300019]  [] sock_register+0x61/0x6a
[0.300019]  [] netlink_proto_init+0xf4/0x11a
[0.300019]  [] ? kernel_init+0x0/0x6c
[0.300019]  [] do_initcalls+0x7a/0x192
[0.300019]  [] ? create_proc_entry+0x6c/0x80
[0.300019]  [] ? mcount_call+0x5/0x9
[0.300019]  [] ? register_irq_proc+0xe/0x8b
[0.300019]  [] ? load_elf_binary+0x818/0x9e0
[0.300019]  [] ? kernel_init+0x0/0x6c
[0.300019]  [] do_basic_setup+0x21/0x23
[0.300019]  [] kernel_init+0x31/0x6c
[0.300019]  [] kernel_thread_helper+0x7/0x10
[0.300019]  ===
[0.300019] ---[ end trace ca143223eefdc828 ]---
[0.300019] irq event stamp: 1894
[0.300019] hardirqs last  enabled at (1893): [] 
trace_hardirqs_on+0xb/0xd
[0.300019] hardirqs last disabled at (1894): [] 
trace_hardirqs_off+0xb/0xd
[0.300019] softirqs last  enabled at (1184): [] 
__do_softirq+0x92/0x97
[0.300019] softirqs last disabled at (1175): [] 
do_softirq+0x2f/0x47

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ftrace and kexec

2008-02-19 Thread Dhaval Giani

Hi,

I've been running ftrace on the sched-devel tree. I just built a kernel
and tried rebooting using kexec and I get this,

Please stand by while rebooting the system...
[11756.528997] Starting new kernel
[11741.142898] BUG: unable to handle kernel paging request at 8d2ed42c
[11741.142898] IP: [] ftrace_record_ip+0x2b/0x14f
[11741.142898] *pdpt = 29829001 *pde =  
[11741.142898] Oops: 0002 [#1] SMP 
[11741.142898] Modules linked in:
[11741.142898] 
[11741.142898] Pid: 16765, comm: kexec Not tainted (2.6.25-rc2-sched-devel #5)
[11741.142898] EIP: 0060:[] EFLAGS: 00010002 CPU: 0
[11741.142898] EIP is at ftrace_record_ip+0x2b/0x14f
[11741.142898] EAX: c0620760 EBX: f68a3470 ECX:  EDX: 
[11741.142898] ESI: c0117000 EDI: 238ef000 EBP: eb02be20 ESP: eb02be0c
[11741.142898]  DS: 0068 ES: 0068 FS: 0068 GS: 0068 SS: 0068
[11741.142898] Process kexec (pid: 16765, ti=eb02a000 task=f6ea0520 
task.ti=eb02a000)
[11741.142898] Stack: c0620760 c011541b f68a3470 c0117000 238ef000 eb02be48 
c0105980  
[11741.142898] c000 c011541b   e38ef000 
c0115430 eb02be90 
[11741.142898]c01154f1 c0620760 238ef000 c0116000 00634000 c0634000 
00637000 c0637000 
[11741.142898] Call Trace:
[11741.142898]  [] ? set_gdt+0xb/0x18
[11741.142898]  [] ? handle_vm86_fault+0x2ce/0x75f
[11741.142898]  [] ? mcount_call+0x5/0x9
[11741.142898]  [] ? set_gdt+0xb/0x18
[11741.142898]  [] ? load_segments+0x8/0x20
[11741.142898]  [] ? machine_kexec+0x93/0xaf
[11741.142898]  [] ? relocate_kernel+0x0/0x94
[11741.142898]  [] ? kernel_kexec+0x32/0x37
[11741.142898]  [] ? sys_reboot+0x13f/0x15e
[11741.142898]  [] ? unlock_page+0x2a/0x2d
[11741.142898]  [] ? __do_fault+0x33e/0x37e
[11741.142898]  [] ? poison_obj+0x23/0x40
[11741.142898]  [] ? do_linear_fault+0x42/0x49
[11741.142898]  [] ? handle_mm_fault+0x142/0x29c
[11741.142898]  [] ? do_page_fault+0x20a/0x453
[11741.142898]  [] ? __lock_release+0x23/0x56
[11741.142898]  [] ? do_page_fault+0x20a/0x453
[11741.142898]  [] ? up_read+0x1b/0x2e
[11741.142898]  [] ? do_page_fault+0x20a/0x453
[11741.142898]  [] ? trace_hardirqs_on_thunk+0xc/0x10
[11741.142898]  [] ? sysenter_past_esp+0x5f/0x99
[11741.142898]  ===
[11741.142898] Code: 55 89 e5 57 56 53 51 51 83 3d 80 fd 84 c0 00 89 45 f0 0f 
84 30 01 00 00 c7 45 ec 60 07 62 c0 64 8b 15 10 f1 61 c0 b8 60 07 62 c0  04 
10 64 8b 15 1 
[11741.142898] EIP: [] ftrace_record_ip+0x2b/0x14f SS:ESP 
0068:eb02be0c
[11741.142898] Kernel panic - not syncing: Fatal exception

.config is here.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc2
# Tue Feb 19 13:59:40 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID1

Re: sched-devel latencies

2008-02-18 Thread Dhaval Giani

On Mon, Feb 18, 2008 at 04:19:33PM +0530, Dhaval Giani wrote:
> Hi Ingo,
> 
> I am running the sched-devel tree (at HEAD
> 44e770a8750abc7e876076cda718b413bad9e654) and it is not looking good.
> 
> I am running two "make -j"s for the kernel in two different cgroups and
> interactivity is going for a toss. I can see noticable lags in
> keypresses.
> 
> Will get down to debugging it further a bit later on.
> 

Some more numbers, with exact scenario

1. Mount the cgroup
2. Make 3 groups
3. Start kernbench in each group
4. Start chew.

This is the chew output from the root cgroup 

[EMAIL PROTECTED] dhaval]# ./chew2 
pid 29345 preempted 544115 us after 1560 us
pid 29345 preempted 588109 us after 3935 us
pid 29345 preempted 632122 us after 3941 us
pid 29345 preempted 794259 us after 3954 us
pid 29345 preempted 972163 us after 3963 us
pid 29345 preempted 1024219 us after 3942 us

>From within one of the groups
[EMAIL PROTECTED] dhaval]# ./chew2 
pid 27961 preempted 4028 us after 1708 us
pid 27961 preempted 28090 us after 5466 us
pid 27961 preempted 52021 us after 6505 us
pid 27961 preempted 56100 us after 7183 us
pid 27961 preempted 61850 us after 7505 us
pid 27961 preempted 131892 us after 59 us
pid 27961 preempted 332112 us after 7607 us

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sched-devel latencies

2008-02-18 Thread Dhaval Giani

Hi Ingo,

I am running the sched-devel tree (at HEAD
44e770a8750abc7e876076cda718b413bad9e654) and it is not looking good.

I am running two "make -j"s for the kernel in two different cgroups and
interactivity is going for a toss. I can see noticable lags in
keypresses.

Will get down to debugging it further a bit later on.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc2: Reported regressions from 2.6.24

2008-02-18 Thread Dhaval Giani

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=9982
> Subject   : 2.6.25-rc1 panics on boot
> Submitter : Dhaval Giani <[EMAIL PROTECTED]>
> Date  : 2008-02-13 18:03
> References: http://lkml.org/lkml/2008/2/13/363
> Handled-By: Chris Snook <[EMAIL PROTECTED]>

Hi Rafael,

A fix was proposed and accepted at
http://bugzilla.kernel.org/attachment.cgi?id=14832&action=view .

The bug has been marked as resolved. (You might want to modify your
script to handle such cases.)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/2] reworking load_balance_monitor

2008-02-18 Thread Dhaval Giani

On Thu, Feb 14, 2008 at 04:57:24PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> Here the current patches that rework load_balance_monitor.
> 
> The main reason for doing this is to eliminate the wakeups the thing 
> generates,
> esp. on an idle system. The bonus is that it removes a kernel thread.
> 

Hi Peter,

The changes look really good to me. I will give it a run in sometime and
give some more feedback.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

On Thu, Feb 14, 2008 at 12:06:31PM +0530, Dhaval Giani wrote:
> On Wed, Feb 13, 2008 at 10:32:02PM -0800, Yinghai Lu wrote:
> > On Wed, Feb 13, 2008 at 10:20 PM, Dhaval Giani
> > <[EMAIL PROTECTED]> wrote:
> > > On Wed, Feb 13, 2008 at 01:08:42PM -0500, Chris Snook wrote:
> > >  > Dhaval Giani wrote:
> > >  >> I am getting the following oops on bootup on 2.6.25-rc1
> > >  > ...
> > >  >> I am booting using kexec with maxcpus=1. It does not have any problems
> > >  >> with maxcpus=2 or higher.
> > >  >
> > >  > Sounds like another (the same?) kexec cpu numbering bug.  Can you 
> > > post/link
> > >  > the entire dmesg from both a cold boot and a kexec boot so we can 
> > > compare?
> > >  >
> > >
> > >  Don't think its a kexec bug. Get the same on cold boot. dmesg from kexec 
> > > boot.
> > 
> > how about without "[EMAIL PROTECTED] nmi_watchdog=2"
> > 
> > also does intel cpu support nmi_watchdog=2?
> > 
> 
> Yes it does. I've used it to get some useful debug information. I will try
> that out.
> 

Panics at same point.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 10:32:02PM -0800, Yinghai Lu wrote:
> On Wed, Feb 13, 2008 at 10:20 PM, Dhaval Giani
> <[EMAIL PROTECTED]> wrote:
> > On Wed, Feb 13, 2008 at 01:08:42PM -0500, Chris Snook wrote:
> >  > Dhaval Giani wrote:
> >  >> I am getting the following oops on bootup on 2.6.25-rc1
> >  > ...
> >  >> I am booting using kexec with maxcpus=1. It does not have any problems
> >  >> with maxcpus=2 or higher.
> >  >
> >  > Sounds like another (the same?) kexec cpu numbering bug.  Can you 
> > post/link
> >  > the entire dmesg from both a cold boot and a kexec boot so we can 
> > compare?
> >  >
> >
> >  Don't think its a kexec bug. Get the same on cold boot. dmesg from kexec 
> > boot.
> 
> how about without "[EMAIL PROTECTED] nmi_watchdog=2"
> 
> also does intel cpu support nmi_watchdog=2?
> 

Yes it does. I've used it to get some useful debug information. I will try
that out.

> YH

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 01:08:42PM -0500, Chris Snook wrote:
> Dhaval Giani wrote:
>> I am getting the following oops on bootup on 2.6.25-rc1
> ...
>> I am booting using kexec with maxcpus=1. It does not have any problems
>> with maxcpus=2 or higher.
>
> Sounds like another (the same?) kexec cpu numbering bug.  Can you post/link 
> the entire dmesg from both a cold boot and a kexec boot so we can compare?
>

Don't think its a kexec bug. Get the same on cold boot. dmesg from kexec boot.

[0.00] Linux version 2.6.25-rc1 ([EMAIL PROTECTED]) (gcc version 3.4.4 
20050721 (Red Hat 3.4.4-2)) #5 SMP Thu Feb 14 06:46:02 IST 2008
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820: 0100 - 0009dc00 (usable)
[0.00]  BIOS-e820: 0009dc00 - 000a (reserved)
[0.00]  BIOS-e820: 0010 - e97f5f00 (usable)
[0.00]  BIOS-e820: e97f5f00 - e97ff800 (ACPI data)
[0.00]  BIOS-e820: e97ff800 - e980 (reserved)
[0.00]  BIOS-e820: fec0 - 0001 (reserved)
[0.00]  BIOS-e820: 0001 - 00014000 (usable)
[0.00] 4224MB HIGHMEM available.
[0.00] 896MB LOWMEM available.
[0.00] Scan SMP from c000 for 1024 bytes.
[0.00] Scan SMP from c009fc00 for 1024 bytes.
[0.00] Scan SMP from c00f for 65536 bytes.
[0.00] Scan SMP from c009dc00 for 1024 bytes.
[0.00] found SMP MP-table at [c009dd40] 0009dd40
[0.00] Reserving 64MB of memory at 16MB for crashkernel (System RAM: 
5111MB)
[0.00] Zone PFN ranges:
[0.00]   DMA 0 -> 4096
[0.00]   Normal   4096 ->   229376
[0.00]   HighMem229376 ->  1310720
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[1] active PFN ranges
[0.00] 0:0 ->  1310720
[0.00] DMI 2.3 present.
[0.00] Using APIC driver default
[0.00] ACPI: RSDP 000FDD90, 0014 (r0 IBM   )
[0.00] ACPI: RSDT E97FF780, 0030 (r1 IBMSERONYXP 1000 IBM  
45444F43)
[0.00] ACPI: FACP E97FF700, 0074 (r1 IBMSERONYXP 1000 IBM  
45444F43)
[0.00] ACPI: DSDT E97F5F00, 962E (r1 IBMSERAVATR 1000 MSFT  
10B)
[0.00] ACPI: FACS E97FF5C0, 0040
[0.00] ACPI: APIC E97FF600, 00CA (r1 IBMSERONYXP 1000 IBM  
45444F43)
[0.00] ACPI: ASF! E97FF540, 004B (r16 IBMSERONYXP1 IBM  
45444F43)
[0.00] ACPI: PM-Timer IO Port: 0x488
[0.00] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[0.00] Processor #0 15:2 APIC version 20
[0.00] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled)
[0.00] Processor #2 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled)
[0.00] Processor #4 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x06] enabled)
[0.00] Processor #6 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x01] enabled)
[0.00] Processor #1 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x03] enabled)
[0.00] Processor #3 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x05] enabled)
[0.00] Processor #5 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
[0.00] Processor #7 15:2 APIC version 20
[0.00] WARNING: maxcpus limit of 1 reached. Processor ignored.
[0.00] ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x04] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x06] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x05] dfl dfl lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x07] dfl dfl lint[0x1])
[0.00] ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0])
[0.00] IOAPIC[0]: apic_id 14, version 17, address 0xfec0, GSI 0-15
[0.00] ACPI: IOAPIC (id[0x0d] address[0xfec01000] gsi_base[16])
[0.00] IOAPIC[1]: apic_id 13, version 17, address 0xfec01000, GSI 16-31
[0.00] ACPI: IOAPIC (id[0x0c] address[0xfec02000] gsi_base[32])
[0.00] IOAPIC[2]:

2.6.25-rc1 panics on boot

2008-02-13 Thread Dhaval Giani

Hi,

I am getting the following oops on bootup on 2.6.25-rc1

[2.376187] BUG: unable to handle kernel NULL pointer dereference at 010c
[2.388180] IP: [] sysfs_remove_link+0x1/0xd
[2.396182] *pdpt = 005fd001 *pde =  
[2.404751] Oops:  [#1] SMP 
[2.408179] Modules linked in:
[2.408179] 
[2.408179] Pid: 1, comm: swapper Not tainted (2.6.25-rc1 #3)
[2.408179] EIP: 0060:[] EFLAGS: 00010206 CPU: 0
[2.408179] EIP is at sysfs_remove_link+0x1/0xd
[2.408179] EAX: 00f0 EBX: f7202cc8 ECX: f789eaf0 EDX: c0533e87
[2.408179] ESI: f793c970 EDI: ffed EBP: f78a1ea0 ESP: f78a1e90
[2.408179]  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
[2.408179] Process swapper (pid: 1, ti=f78a task=f789eaf0 
task.ti=f78a)
[2.408179] Stack: f78a1ea0 c02709e4 c0568920 f793c970 f78a1eb4 c0269fb5 
f793c970 f793cb7c 
[2.408179]c0568920 f78a1ecc c0269b25  f793cb7c  
c05689f0 f78a1ee0 
[2.408179]c02b6460 c05689f0 f793cb7c  f78a1ef4 c02b6528 
f793cb7c f78c332c 
[2.408179] Call Trace:
[2.408179]  [] ? acpi_processor_remove+0x82/0xb4
[2.408179]  [] ? acpi_start_single_object+0x3a/0x41
[2.408179]  [] ? acpi_device_probe+0x3b/0x79
[2.408179]  [] ? really_probe+0x74/0xf2
[2.408179]  [] ? driver_probe_device+0x37/0x40
[2.408179]  [] ? __driver_attach+0x76/0xaf
[2.408179]  [] ? bus_for_each_dev+0x38/0x5d
[2.408179]  [] ? kobject_init_and_add+0x20/0x22
[2.408179]  [] ? driver_attach+0x14/0x16
[2.408179]  [] ? __driver_attach+0x0/0xaf
[2.408179]  [] ? bus_add_driver+0x99/0x149
[2.408179]  [] ? driver_register+0x43/0x69
[2.408179]  [] ? acpi_bus_register_driver+0x3a/0x3c
[2.408179]  [] ? acpi_processor_init+0x70/0xa6
[2.408179]  [] ? kernel_init+0x0/0x88
[2.408179]  [] ? do_initcalls+0x75/0x18d
[2.408179]  [] ? create_proc_entry+0x67/0x7b
[2.408179]  [] ? register_irq_proc+0xa4/0xba
[2.408179]  [] ? pagemap_read+0x13a/0x1c2
[2.408179]  [] ? kernel_init+0x0/0x88
[2.408179]  [] ? do_basic_setup+0x1c/0x1e
[2.408179]  [] ? kernel_init+0x4d/0x88
[2.408179]  [] ? kernel_thread_helper+0x7/0x10
[2.408179]  ===
[2.408179] Code: c0 74 07 89 f0 e8 57 f4 ff ff 85 ff 74 11 90 ff 0f 0f 94 
c0 84 c0 74 07 89 f8 e8 42 f4 ff ff 8b 45 d8 83 c4 1c 5b 5e 5f 5d c3 55 <8b> 40 
1c 89 e5 e8 c 
[2.408179] EIP: [] sysfs_remove_link+0x1/0xd SS:ESP 0068:f78a1e90
[2.408191] ---[ end trace 778e504de7e3b1e3 ]---
[2.412183] Kernel panic - not syncing: Attempted to kill init!

I am booting using kexec with maxcpus=1. It does not have any problems
with maxcpus=2 or higher.

config

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc1
# Wed Feb 13 17:30:43 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CPUSETS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_NAMESPAC

Re: Regression in latest sched-git

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> > > On the same lines, I cant understand how we can be seeing 700ms latency
> > > (below) unless we had: large number of active groups/users and large 
> > > number of 
> > > tasks within each group/user.
> > 
> > All I can say it that its trivial to reproduce these horrid latencies.
> > 
> 
> Hi Peter,
> 
> I've been trying to reproduce the latencies, and the worst I have
> managed only 80ms. At an average I am getting around 60 ms. This is with
> a make -j4 as root, and dhaval running other programs. (with maxcpus=1).
> 

Totally missed here. Any more hints to reproduce?

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Regression in latest sched-git

2008-02-13 Thread Dhaval Giani

On Wed, Feb 13, 2008 at 01:51:18PM +0100, Peter Zijlstra wrote:
> 
> On Wed, 2008-02-13 at 08:30 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> > > Yes, latency isolation is the one thing I had to sacrifice in order to
> > > get the normal latencies under control.
> > 
> > Hi Peter,
> > I don't have easy solution in mind either to meet both fairness
> > and latency goals in a acceptable way.
> 
> Ah, do be careful with 'fairness' here. The single RQ is fair wrt cpu
> time, just not quite as 'fair' wrt to latency.
> 
> > But I am puzzled at the max latency numbers you have provided below:
> > 
> > > The problem with the old code is that under light load: a kernel make
> > > -j2 as root, under an otherwise idle X session, generates latencies up
> > > to 120ms on my UP laptop. (uid grouping; two active users: peter, root).
> > 
> > If it was just two active users, then max latency should be:
> > 
> > latency to schedule user entity (~10ms?) +
> > latency to schedule task within that user 
> > 
> > 20-30 ms seems more reaonable max latency to expect in this scenario.
> > 120ms seems abnormal, unless the user had large number of tasks.
> > 
> > On the same lines, I cant understand how we can be seeing 700ms latency
> > (below) unless we had: large number of active groups/users and large number 
> > of 
> > tasks within each group/user.
> 
> All I can say it that its trivial to reproduce these horrid latencies.
> 

Hi Peter,

I've been trying to reproduce the latencies, and the worst I have
managed only 80ms. At an average I am getting around 60 ms. This is with
a make -j4 as root, and dhaval running other programs. (with maxcpus=1).

> As for Ingo's setup, the worst that he does is run distcc with (32?)
> instances on that machine - and I assume he has that user niced waay
> down.
> 
> > > Others have reported latencies up to 300ms, and Ingo found a 700ms
> > > latency on his machine.
> > > 
> > > The source for this problem is I think the vruntime driven wakeup
> > > preemption (but I'm not quite sure). The other things that rely on
> > > global vruntime are sleeper fairness and yield. Now while I can't
> > > possibly care less about yield, the loss of sleeper fairness is somewhat
> > > sad (NB. turning it off with the old group scheduling does improve life
> > > somewhat).
> > > 
> > > So my first attempt at getting a global vruntime was flattening the
> > > whole RQ structure, you can see that patch in sched.git (I really ought
> > > to have posted that, will do so tomorrow).
> > 
> > We will do some exhaustive testing with this approach. My main concern
> > with this is that it may compromise the level of isolation between two
> > groups (imagine one group does a fork-bomb and how it would affect
> > fairness for other groups).
> 
> Again, be careful with the fairness issue. CPU time should still be
> fair, but yes, other groups might experience some latencies.
> 

I know I am missing something, but aren't we trying to reduce latencies
here?

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Regression in latest sched-git

2008-02-12 Thread Dhaval Giani

Hi Ingo,

I've been running the latest sched-git through some tests. Here is
essentially what I am doing,

1. Mount the control group
2. Create 3-4 groups
3. Start kernbench inside each group
4. Run cpu hogs in each group

Essentially the idea is to see how the system responds under extreme CPU
load.

This is what I get (and this is in a shell which belongs to the root
group)
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.212s
user0m0.004s
sys 0m0.000s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.200s
user0m0.000s
sys 0m0.004s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.266s
user0m0.000s
sys 0m0.000s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.113s
user0m0.000s
sys 0m0.000s
[EMAIL PROTECTED] ~]# 

On the sched-devel tree that I have, the same gives me following
results.

[EMAIL PROTECTED] ~]# time sleep 1

real0m1.057s
user0m0.000s
sys 0m0.004s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.038s
user0m0.000s
sys 0m0.004s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.075s
user0m0.000s
sys 0m0.000s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.071s
user0m0.000s
sys 0m0.000s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.073s
user0m0.000s
sys 0m0.004s
[EMAIL PROTECTED] ~]# time sleep 1

real0m1.055s
user0m0.000s
sys 0m0.004s

I agree this is not a very great test. Its getting a bit late here. I
will get some better test case tomorrow morning (and if you have some, I
can try those as well). I just did not want the tree to get merged in
without further discussion.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 12/11] sched: rt-group: uid-group interface

2008-02-06 Thread Dhaval Giani

On Wed, Jan 09, 2008 at 04:05:31PM -0800, Greg KH wrote:
> > > > Ingo, Greg,
> > > > 
> > > > What would be the easiest way to carry this forward? sched-devel and
> > > > greg's tree would intersect at this point and leave poor akpm with the
> > > > resulting mess. Should I just make an incremental patch akpm can carry
> > > > and push? Or can we base one tree off the other?
> > > 
> > > If it's just a single patch for this, I'd be glad to take it.  But by 
> > > looking at the [11/12] above, I doubt this is so...
> > > 
> > > If it's not that rough (12 patches is not a big deal), I'd be glad to 
> > > take these through my tree, after you fix up Kay's requests above :)
> > 
> > hm, i'd really like to see this tested and go through sched.git. It's 
> > only the few sysfs bits which interfere, right?
> 
> Yes, that should be it.
> 
> So why not put the majority of this through sched.git, then when my
> sysfs changes go in at the beginning of the .25 merge cycle, you can
> then add the sysfs changes through your tree or anywhere else.
> 

Hi,

I was wondering where these changes are right now. I don't see the sysfs
interface for rt-group-sched in mainline right now.

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: {2.6.22.y} quicklists must keep even off node pages on the quicklists until the TLB flush has been completed.

2008-02-05 Thread Dhaval Giani

On Tue, Feb 05, 2008 at 10:06:02PM +0100, Oliver Pinter wrote:
> it is already im queue for 2.6.23,
> 
> 8<-
> >From [EMAIL PROTECTED] Sat Dec 22 14:04:08 2007
> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Sat, 22 Dec 2007 14:03:23 -0800
> Subject: quicklists: do not release off node pages early
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED],
> [EMAIL PROTECTED], [EMAIL PROTECTED]
> Message-ID: <[EMAIL PROTECTED]>
> 
> 
> From: Christoph Lameter <[EMAIL PROTECTED]>
> 
> patch ed367fc3a7349b17354c7acef55157764859 in mainline.
> 
> quicklists must keep even off node pages on the quicklists until the TLB
> flush has been completed.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> Cc: Dhaval Giani <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>
> 
> ---
>  include/linux/quicklist.h |8 
>  1 file changed, 8 deletions(-)
> 
> --- a/include/linux/quicklist.h
> +++ b/include/linux/quicklist.h
> @@ -56,14 +56,6 @@ static inline void __quicklist_free(int
>   struct page *page)
>  {
>   struct quicklist *q;
> - int nid = page_to_nid(page);
> -
> - if (unlikely(nid != numa_node_id())) {
> - if (dtor)
> - dtor(p);
> - __free_page(page);
> - return;
> - }
> 
>   q = &get_cpu_var(quicklist)[nr];
>   *(void **)p = q->page;
> 
> >8--
> Tested-by: Oliver Pinter <[EMAIL PROTECTED]> (on i386)
> 

Christoph,

Is this one also supposed to be backported?

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM killer on idle machine

2008-02-05 Thread Dhaval Giani

usd:  47
> Feb  4 16:35:56 kernel: CPU2: Hot: hi:  186, btch:  31 usd:  25   Cold: 
> hi:   62, btch:  15 usd:  50
> Feb  4 16:35:56 kernel: CPU3: Hot: hi:  186, btch:  31 usd:   7   Cold: 
> hi:   62, btch:  15 usd:  59
> Feb  4 16:35:56 kernel: HighMem per-cpu:
> Feb  4 16:35:56 kernel: CPU0: Hot: hi:  186, btch:  31 usd:   6   Cold: 
> hi:   62, btch:  15 usd:   6
> Feb  4 16:35:56 kernel: CPU1: Hot: hi:  186, btch:  31 usd:  15   Cold: 
> hi:   62, btch:  15 usd:  14
> Feb  4 16:35:56 kernel: CPU2: Hot: hi:  186, btch:  31 usd:  98   Cold: 
> hi:   62, btch:  15 usd:   4
> Feb  4 16:35:56 kernel: CPU3: Hot: hi:  186, btch:  31 usd:  47   Cold: 
> hi:   62, btch:  15 usd:   3
> Feb  4 16:35:56 kernel: Active:13145 inactive:7163 dirty:0 writeback:0 
> unstable:0
> Feb  4 16:35:56 kernel:  free:1847537 slab:4426 mapped:3787 pagetables:360 
> bounce:0
> Feb  4 16:35:56 kernel: DMA free:3560kB min:68kB low:84kB high:100kB 
> active:8kB inactive:0kB present:16256kB pages_scanned:108 all_unreclaimable? 
> yes
> Feb  4 16:35:56 kernel: lowmem_reserve[]: 0 873 9636 9636
> Feb  4 16:35:56 kernel: Normal free:3980kB min:3744kB low:4680kB high:5616kB 
> active:5396kB inactive:5008kB present:894080kB pages_scanned:20561 
> all_unreclaimable? yes
> Feb  4 16:35:56 kernel: lowmem_reserve[]: 0 0 70104 70104
> Feb  4 16:35:56 kernel: HighMem free:7382608kB min:512kB low:9912kB 
> high:19316kB active:47176kB inactive:23644kB present:8973312kB 
> pages_scanned:0 all_unreclaimable? no
> Feb  4 16:35:56 kernel: lowmem_reserve[]: 0 0 0 0
> Feb  4 16:35:56 kernel: DMA: 2*4kB 4*8kB 2*16kB 1*32kB 0*64kB 1*128kB 1*256kB 
> 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3560kB
> Feb  4 16:35:56 kernel: Normal: 75*4kB 10*8kB 21*16kB 4*32kB 3*64kB 1*128kB 
> 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3980kB
> Feb  4 16:35:56 kernel: HighMem: 1036*4kB 2516*8kB 1914*16kB 1571*32kB 
> 1176*64kB 785*128kB 533*256kB 336*512kB 230*1024kB 148*2048kB 1527*4096kB = 
> 7382608kB
> Feb  4 16:35:56 kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0
> Feb  4 16:35:56 kernel: Free swap  = 16777208kB
> Feb  4 16:35:56 kernel: Total swap = 16777208kB
> Feb  4 16:35:56 kernel: Free swap:   16777208kB
> Feb  4 16:35:56 kernel: 2490368 pages of RAM
> Feb  4 16:35:56 kernel: 2260992 pages of HIGHMEM
> Feb  4 16:35:56 kernel: 415332 reserved pages
> Feb  4 16:35:56 kernel: 25392 pages shared
> Feb  4 16:35:56 kernel: 0 pages swap cached
> Feb  4 16:35:56 kernel: 0 pages dirty
> Feb  4 16:35:56 kernel: 0 pages writeback
> Feb  4 16:35:56 kernel: 3787 pages mapped
> Feb  4 16:35:56 kernel: 4426 pages slab
> Feb  4 16:35:56 kernel: 360 pages pagetables
> 

Looks like quicklists again. Can you try patching your kernel with the
following patch. It should be there in the next stable release.


commit 96990a4ae979df9e235d01097d6175759331e88c
Author: Christoph Lameter <[EMAIL PROTECTED]>
Date:   Mon Jan 14 00:55:14 2008 -0800

quicklists: Only consider memory that can be used with GFP_KERNEL

Quicklists calculates the size of the quicklists based on the number of
free pages.  This must be the number of free pages that can be allocated
with GFP_KERNEL.  node_page_state() includes the pages in ZONE_HIGHMEM and
ZONE_MOVABLE which may lead the quicklists to become too large causing OOM.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Tested-by: Dhaval Giani <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

diff --git a/mm/quicklist.c b/mm/quicklist.c
index ae8189c..3f703f7 100644
--- a/mm/quicklist.c
+++ b/mm/quicklist.c
@@ -26,9 +26,17 @@ DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];
 static unsigned long max_pages(unsigned long min_pages)
 {
unsigned long node_free_pages, max;
+   struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+   node_free_pages =
+#ifdef CONFIG_ZONE_DMA
+   zone_page_state(&zones[ZONE_DMA], NR_FREE_PAGES) +
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   zone_page_state(&zones[ZONE_DMA32], NR_FREE_PAGES) +
+#endif
+   zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
 
-   node_free_pages = node_page_state(numa_node_id(),
-   NR_FREE_PAGES);
max = node_free_pages / FRACTION_OF_NODE_MEM;
return max(max, min_pages);
 }

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOM-killer invoked but why ?

2008-02-05 Thread Dhaval Giani

On Tue, Feb 05, 2008 at 02:07:37AM -0800, Andrew Morton wrote:
> On Thu, 31 Jan 2008 13:53:05 +0100 Claude Frantz <[EMAIL PROTECTED]> wrote:
> 
> > Hello !
> > 
> > I'm faced to a problem where the OOM-killer is invoked but I cannot find
> > the reason why. The machine is rather powerfull, the load is very moderate,
> > the disk swap space is nearly unused. The only strange observation which
> > appears to me is the slow but progressive decreasing of kbbuffers during
> > many hours.
> > 
> > Can you help me to diagnose the problem and to find a good solution ?
> > 
> > ...
> >
> > Jan 28 03:50:49 toaster kernel: 177466 pages slab
> > Jan 28 03:50:49 toaster kernel: 1915 pages pagetables
> > Jan 28 03:50:49 toaster kernel: Out of memory: kill process 10859 (amavisd) 
> > score 36218 or a child
> > Jan 28 03:50:49 toaster kernel: Killed process 19146 (amavisd)
> 
> slab.  Maybe you've been bitten by the quicklist leak.  If you're able to
> patch your kernel then please try this fix:
> 
> commit 96990a4ae979df9e235d01097d6175759331e88c
> Author: Christoph Lameter <[EMAIL PROTECTED]>
> Date:   Mon Jan 14 00:55:14 2008 -0800
> 
> quicklists: Only consider memory that can be used with GFP_KERNEL
> 
> Quicklists calculates the size of the quicklists based on the number of
> free pages.  This must be the number of free pages that can be allocated
> with GFP_KERNEL.  node_page_state() includes the pages in ZONE_HIGHMEM and
>     ZONE_MOVABLE which may lead the quicklists to become too large causing 
> OOM.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> Tested-by: Dhaval Giani <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
> 
> diff --git a/mm/quicklist.c b/mm/quicklist.c
> index ae8189c..3f703f7 100644
> --- a/mm/quicklist.c
> +++ b/mm/quicklist.c
> @@ -26,9 +26,17 @@ DEFINE_PER_CPU(struct quicklist, 
> quicklist)[CONFIG_NR_QUICK];
>  static unsigned long max_pages(unsigned long min_pages)
>  {
>   unsigned long node_free_pages, max;
> + struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
> +
> + node_free_pages =
> +#ifdef CONFIG_ZONE_DMA
> + zone_page_state(&zones[ZONE_DMA], NR_FREE_PAGES) +
> +#endif
> +#ifdef CONFIG_ZONE_DMA32
> + zone_page_state(&zones[ZONE_DMA32], NR_FREE_PAGES) +
> +#endif
> + zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
> 
> - node_free_pages = node_page_state(numa_node_id(),
> - NR_FREE_PAGES);
>   max = node_free_pages / FRACTION_OF_NODE_MEM;
>   return max(max, min_pages);
>  }
> 
> 
> I note that this didn't have the [EMAIL PROTECTED] cc.  Christoph, did we
> deliberately decide not to backport?
> 

According to
http://archive.netbsd.se/?ml=linux-stable-commits&a=2008-01&m=6134301 ,
its been added to the stable tree. I remember asking Greg to add it.

Thanks
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Default child of a cgroup

2008-01-31 Thread Dhaval Giani

On Thu, Jan 31, 2008 at 09:37:42PM +0100, Peter Zijlstra wrote:
> 
> On Thu, 2008-01-31 at 23:39 +0530, Balbir Singh wrote:
> > Srivatsa Vaddagiri wrote:
> > > Hi,
> > >   As we were implementing multiple-hierarchy support for CPU
> > > controller, we hit some oddities in its implementation, partly related
> > > to current cgroups implementation. Peter and I have been debating on the 
> > > exact solution and I thought of bringing that discussion to lkml.
> > > 
> > > Consider the cgroup filesystem structure for managing cpu resource.
> > > 
> > >   # mount -t cgroup -ocpu,cpuacct none /cgroup
> > >   # mkdir /cgroup/A
> > >   # mkdir /cgroup/B
> > >   # mkdir /cgroup/A/a1
> > > 
> > > will result in:
> > > 
> > >   /cgroup
> > >  |--
> > >  |--
> > >  |--
> > >  |
> > >  |[A]
> > >  | |
> > >  | |
> > >  | |
> > >  | |
> > >  | |---[a1]
> > >  |   |
> > >  |   |
> > >  |   |
> > >  |   |
> > >  |
> > >  |[B]
> > >  | |
> > >  | |
> > >  | |
> > >  | 
> > > 
> > > 
> > > Here are some questions that arise in this picture:
> > > 
> > > 1. What is the relationship of the task-group in A/tasks with the
> > >task-group in A/a1/tasks? In otherwords do they form siblings
> > >of the same parent A?
> > > 
> > 
> > I consider them to be the same relationship between directories and files.
> > A/tasks are siblings of A/a1 and A/other children, *but* the entities of
> > interest are A and A/a1.
> > 
> > > 2. Somewhat related to the above question, how much resource should the 
> > >task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
> > >A's share or 1/(1 + N) of parent A's share (where N = number of tasks
> > >in A/tasks)?
> > > 
> > 
> > I propose that it gets 1/2 of the bandwidth, here is why
> > 
> > 1. Assume that a task in A/tasks forks 1000 children, what happens to the
> > bandwidth of A/a1's tasks then? We have no control over how many tasks can 
> > be
> > created on A/tasks as a consequence of moving one task to A/tasks. Doing it 
> > the
> > other way would mean, that A/a1/tasks will get 1/1001 of the bandwidth 
> > (sounds
> > very unfair and prone to Denial of Service/Fairness)
> 
> And I oppose this, it means not all siblings are treated equal. Also, I
> miss the story of the 'hidden' group here. The biggest objection is this
> hidden group with no direct controls.
> 
> My proposal is to make it a hard constraint, either a group has task
> children or a group has group children, but not mixed. That keeps the
> interface explicit and doesn't hide the tricks we play.
> 

That is one solution. Otherwise you provide the controls for the hidden
group. (Namely the shares and the rt_ratio). I've been experimenting
with this approach recently.



> > > Note that user cannot create subdirectories under def_child with this
> > > scheme! I am also not sure what impact this will have on other resources
> > > like cpusets ..
> > > 

I'm not sure why it would affect other resources? The def_child is not
exposed to the cgroup filesystem. Could someone please explain it to me?

> > 
> > Which means we'll need special logic in the cgroup filesystem to handle
> > def_child. Not a very good idea.
> 
> agreed.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Default child of a cgroup

2008-01-31 Thread Dhaval Giani

On Thu, Jan 31, 2008 at 11:39:12PM +0530, Balbir Singh wrote:
> Srivatsa Vaddagiri wrote:
> > Hi,
> > As we were implementing multiple-hierarchy support for CPU
> > controller, we hit some oddities in its implementation, partly related
> > to current cgroups implementation. Peter and I have been debating on the 
> > exact solution and I thought of bringing that discussion to lkml.
> > 
> > Consider the cgroup filesystem structure for managing cpu resource.
> > 
> > # mount -t cgroup -ocpu,cpuacct none /cgroup
> > # mkdir /cgroup/A
> > # mkdir /cgroup/B
> > # mkdir /cgroup/A/a1
> > 
> > will result in:
> > 
> > /cgroup
> >|--
> >|--
> >|--
> >|
> >|[A]
> >| |
> >| |
> >| |
> >| |
> >| |---[a1]
> >|   |
> >|   |
> >|   |
> >|   |
> >|
> >|[B]
> >| |
> >| |
> >| |
> >| 
> > 
> > 
> > Here are some questions that arise in this picture:
> > 
> > 1. What is the relationship of the task-group in A/tasks with the
> >task-group in A/a1/tasks? In otherwords do they form siblings
> >of the same parent A?
> > 
> 
> I consider them to be the same relationship between directories and files.
> A/tasks are siblings of A/a1 and A/other children, *but* the entities of
> interest are A and A/a1.
> 
> > 2. Somewhat related to the above question, how much resource should the 
> >task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
> >A's share or 1/(1 + N) of parent A's share (where N = number of tasks
> >in A/tasks)?
> > 
> 
> I propose that it gets 1/2 of the bandwidth, here is why
> 
> 1. Assume that a task in A/tasks forks 1000 children, what happens to the
> bandwidth of A/a1's tasks then? We have no control over how many tasks can be
> created on A/tasks as a consequence of moving one task to A/tasks. Doing it 
> the
> other way would mean, that A/a1/tasks will get 1/1001 of the bandwidth (sounds
> very unfair and prone to Denial of Service/Fairness)
> 
> 
> > 3. What should A/cpuacct.usage reflect? CPU usage of A/tasks? Or CPU usage
> >of all siblings put together? It can reflect only one, in which case
> >user has to manually derive the other component of the statistics.
> > 
> 
> It should reflect the accumulated usage of A's children and the tasks in A.
> 

I've been taking the root group as an example, and extending it. The
root group does not reflect the usage of all the tasks in it. (IIRC,
can't seem to find the stats file there now, please correct me if I am
wrong)

> > It seems to me that tasks in A/tasks form what can be called the
> > "default" child group of A, in which case:
> > 
> > 4. Modifications to A/cpu.shares should affect the parent or its default
> >child group (A/tasks)?
> > 
> > To avoid these ambiguities, it may be good if cgroup create this
> > "default child group" automatically whenever a cgroup is created?
> > Something like below (not the absence of tasks file in some directories
> > now):
> > 
> 
> I think the concept makes sense, but creating a default child is going to be
> confusing, as it is not really a child of A.
> 

For all practical purposes, it is the same as the init_task_group which
is at the parent level.

> > 
> > /cgroup
> >|
> >|--
> >|--
> >|
> >|---[def_child]
> >| |
> >| |
> >| |
> >| |
> >|
> >|[A]
> >| |
> >| |
> >| |
> >| |
> >| |---[def_child]
> >| | |
> >| | |
> >| | |
> >| | |
> >| | 
> >| |---[a1]
> >|   |
> >|   |
> >|   |
> >|   |
> >|   |---[def_child]
> >|   |   |---
> >|   |   |---
> >|   |   |---
> >|   |   |
> >|
> >|[B]
> >| |
> >| |
> >| |
> >| | 
> >| |---[def_child]
> >| | |
> >| | |
> >| | |
> >| | |
> > 
> > Note that user cannot create subdirectories under def_child with this
> > scheme! I am also not sure what impact this will have on other resources
> > like cpusets ..
> > 
> 
> Which means we'll need special logic in the cgroup filesystem to handle
> def_child. Not a very good idea.
> 

Not really. That issue would come into play if every task group was
assigned a control group. The task group is not exposed to the outside
world. (That's why its a hidden task group)

Thanks,
--

Re: [RFC] Default child of a cgroup

2008-01-31 Thread Dhaval Giani

On Thu, Jan 31, 2008 at 06:39:56PM -0800, Paul Menage wrote:
> On Jan 30, 2008 6:40 PM, Srivatsa Vaddagiri <[EMAIL PROTECTED]> wrote:
> >
> > Here are some questions that arise in this picture:
> >
> > 1. What is the relationship of the task-group in A/tasks with the
> >task-group in A/a1/tasks? In otherwords do they form siblings
> >of the same parent A?
> 
> I'd argue the same as Balbir - tasks in A/tasks are are children of A
> and are siblings of a1, a2, etc.
> 
> >
> > 2. Somewhat related to the above question, how much resource should the
> >task-group A/a1/tasks get in relation to A/tasks? Is it 1/2 of parent
> >A's share or 1/(1 + N) of parent A's share (where N = number of tasks
> >in A/tasks)?
> 
> Each process in A should have a scheduler weight that's derived from
> its static_prio field. Similarly each subgroup of A will have a
> scheduler weight that's determined by its cpu.shares value. So the cpu
> share of any child (be it a task or a subgroup) would be equal to its
> own weight divided by the sum of weights of all children.
> 
> So yes, if a task in A forks lots of children, those children could
> end up getting a disproportionate amount of the CPU compared to tasks
> in A/a1 - but that's the same as the situation without cgroups. If you
> want to control cpu usage between different sets of processes in A,
> they should be in sibling cgroups, not directly in A.
> 
> Is there a restriction in CFS that stops a given group from
> simultaneously holding tasks and sub-groups? If so, couldn't we change
> CFS to make it possible rather than enforcing awkward restructions on
> cgroups?
> 
> If we really can't change CFS in that way, then an alternative would
> be similar to Peter's suggestion - make cpu_cgroup_can_attach() fail
> if the cgroup has children, and make cpu_cgroup_create() fail if the
> cgroup has any tasks - that way you limit the restriction to just the
> hierarchy that has CFS attached to it, rather than generically for all
> cgroups
> 
> BTW, I noticed this code in cpu_cgroup_create():
> 
> /* we support only 1-level deep hierarchical scheduler atm */
> if (cgrp->parent->parent)
> return ERR_PTR(-EINVAL);
> 
> Is anyone working on allowing more levels?
> 

Yes, I am looking at it.

> Paul

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Regression with idle cpu cycle handling in 2.6.24 (compared to 2.6.22)

2008-01-19 Thread Dhaval Giani

On Sun, Jan 20, 2008 at 09:03:38AM +0530, Dhaval Giani wrote:
> On Sat, Jan 19, 2008 at 03:52:44PM +0100, dAniel hAhler wrote:
> > Hello,
> > 
> > I've now found the reason and a workaround for this. Apparently, it's
> > related to CONFIG_FAIR_USER_SCHED and can be worked around by
> > assigning a really small value to the boinc users cpu_share (125 is
> > the uid of "boinc"):
> > $ echo 2 | sudo tee /sys/kernel/uids/125/cpu_share
> > 
> 
> Correct, that is the way to go about it.
> 
> > While looking around, I've found the following patch, which seems to
> > address this:
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0710.3/3849.html
> > 
> > It has been posted here, but without any response.
> > 
> > btw: writing 1 into "cpu_share" totally locks up the computer!
> > 
> 
> Can you please provide some more details. Can you go into another
> console (try ctrl-alt-f1) and try to reproduce the issue there. Could
> you take a photo of the oops/panic and upload it somewhere so that we
> can see it?
> 

I've not been able to reproduce this problem on my system here. Do you
try to change the cpu_share of the current user?

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Regression with idle cpu cycle handling in 2.6.24 (compared to 2.6.22)

2008-01-19 Thread Dhaval Giani

On Sat, Jan 19, 2008 at 03:52:44PM +0100, dAniel hAhler wrote:
> Hello,
> 
> I've now found the reason and a workaround for this. Apparently, it's
> related to CONFIG_FAIR_USER_SCHED and can be worked around by
> assigning a really small value to the boinc users cpu_share (125 is
> the uid of "boinc"):
> $ echo 2 | sudo tee /sys/kernel/uids/125/cpu_share
> 

Correct, that is the way to go about it.

> While looking around, I've found the following patch, which seems to
> address this:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0710.3/3849.html
> 
> It has been posted here, but without any response.
> 
> btw: writing 1 into "cpu_share" totally locks up the computer!
> 

Can you please provide some more details. Can you go into another
console (try ctrl-alt-f1) and try to reproduce the issue there. Could
you take a photo of the oops/panic and upload it somewhere so that we
can see it?

Thanks
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-18 Thread Dhaval Giani

On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi wrote:
> Allow to limit the block I/O bandwidth for specific process containers
> (cgroups) imposing additional delays on I/O requests for those processes
> that exceed the limits defined in the control group filesystem.
> 
> Example:
>   # mkdir /dev/cgroup
>   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup

Just a minor nit, can't we name it as io, keeping in mind that other
controllers are known as cpu and memory?

Will try it out and give some more feedback.

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 refuses to build [Re: 2.6.24-rc8-mm1]

2008-01-17 Thread Dhaval Giani

On Thu, Jan 17, 2008 at 10:58:57PM +0530, Dhaval Giani wrote:
> On Thu, Jan 17, 2008 at 02:35:14AM -0800, Andrew Morton wrote:
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc8/2.6.24-rc8-mm1/
> > 
> 
> Hi Ingo, Thomas,
> 
> x86 fails to build with 
> 
> arch/x86/mm/discontig_32.c:39:23: bios_ebda.h:
> No such file or directory
> make[1]: *** [arch/x86/mm/discontig_32.o] Error 1
> make: *** [arch/x86/mm] Error 2
> 
> Applying a trivial fix like,
> 
> ---
>  arch/x86/mm/discontig_32.c |2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.24-rc8/arch/x86/mm/discontig_32.c
> ===
> --- linux-2.6.24-rc8.orig/arch/x86/mm/discontig_32.c
> +++ linux-2.6.24-rc8/arch/x86/mm/discontig_32.c
> @@ -36,7 +36,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
> 
> 
> causes,
> 
> kernel/built-in.o(.text+0x5131): In function `move_task_off_dead_cpu':
> include/asm/topology.h:43: undefined reference to
> `x86_cpu_to_node_map_early_ptr'
> kernel/built-in.o(.text+0x5156):include/asm/topology.h:48: undefined
> reference to `per_cpu__x86_cpu_to_node_map'
> kernel/built-in.o(.text+0x5c0d): In function `cpu_to_allnodes_group':
> include/asm/topology.h:43: undefined reference to
> `x86_cpu_to_node_map_early_ptr'
> kernel/built-in.o(.text+0x5c2e):include/asm/topology.h:48: undefined
> reference to `per_cpu__x86_cpu_to_node_map'
> kernel/built-in.o(.text+0x5e84): In function `build_sched_domains':
> include/asm/topology.h:43: undefined reference to
> `x86_cpu_to_node_map_early_ptr'



grepping around and looking through the code, I notice it is because
these variables just do not exist for 32 bit NUMA. I am not sure how to
go about it, and will just leave it to folks who know what they are
doing there :).

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

x86 refuses to build [Re: 2.6.24-rc8-mm1]

2008-01-17 Thread Dhaval Giani

On Thu, Jan 17, 2008 at 02:35:14AM -0800, Andrew Morton wrote:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc8/2.6.24-rc8-mm1/
> 

Hi Ingo, Thomas,

x86 fails to build with 

arch/x86/mm/discontig_32.c:39:23: bios_ebda.h:
No such file or directory
make[1]: *** [arch/x86/mm/discontig_32.o] Error 1
make: *** [arch/x86/mm] Error 2

Applying a trivial fix like,

---
 arch/x86/mm/discontig_32.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.24-rc8/arch/x86/mm/discontig_32.c
===
--- linux-2.6.24-rc8.orig/arch/x86/mm/discontig_32.c
+++ linux-2.6.24-rc8/arch/x86/mm/discontig_32.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);


causes,

kernel/built-in.o(.text+0x5131): In function `move_task_off_dead_cpu':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x5156):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x5c0d): In function `cpu_to_allnodes_group':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x5c2e):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x5e84): In function `build_sched_domains':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x5ea3):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x60ab):include/asm/topology.h:43: undefined
reference to `x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x60ca):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x7239): In function `sched_create_group':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x7251):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x7274):include/asm/topology.h:43: undefined
reference to `x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x7293):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x72b6):include/asm/topology.h:43: undefined
reference to `x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x72d5):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x72f8):include/asm/topology.h:43: undefined
reference to `x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x7317):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0xb5d4): In function `profile_cpu_callback':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0xb5f3):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x12722): In function `init_timers_cpu':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x12741):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.text+0x17bd9): In function `sys_getcpu':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.text+0x17bf8):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
kernel/built-in.o(.init.text+0x94a): In function `create_hash_tables':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
kernel/built-in.o(.init.text+0x96b):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
mm/built-in.o(.text+0x4ea1): In function `nr_free_zone_pages':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
mm/built-in.o(.text+0x4ec8):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
mm/built-in.o(.text+0x5c29): In function `process_zones':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
mm/built-in.o(.text+0x5c50):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
mm/built-in.o(.text+0x7c0a): In function `max_sane_readahead':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
mm/built-in.o(.text+0x7c2f):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
mm/built-in.o(.text+0x7c96):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_map'
mm/built-in.o(.text+0xadcc): In function `zone_reclaim':
include/asm/topology.h:43: undefined reference to
`x86_cpu_to_node_map_early_ptr'
mm/built-in.o(.text+0xadf2):include/asm/topology.h:48: undefined
reference to `per_cpu__x86_cpu_to_node_

Re: Updatedb hangs Kernel 2.6.22.9-cfs-v22

2008-01-16 Thread Dhaval Giani

On Wed, Jan 16, 2008 at 02:40:53PM -0200, Renato S. Yamane wrote:
> Ray Lee escreveu:
>> On Jan 14, 2008 7:28 AM, Renato S. Yamane <[EMAIL PROTECTED]> wrote:
>>> Ray Lee escreveu:
 On Jan 12, 2008 10:03 AM, Renato S. Yamane wrote:
> I can't use updatedb in Debian Etch (stable) using customized Kernel
> 2.6.22.9-cfs-v22.
>
> When I ran updatedb, after ~1 minute my system hangs and "caps lock" LED
> is blinking. No log is registered.
 Please switch out of X11 to a text mode console (CTRL-ALT-F1), and run
 updatedb there. Capture the oops with a digital camera, and post a
 link to the picture (not the picture itself!) to the list. Or, write
 down the oops carefully, and post the text.
>>> I see a infinite loop and is very fast, so I can't capture errors messages.
>>> None is registered in /var/log/*
>>
>> Well, that's very odd. Please also try the latest CFS backport to see
>> if that solves the problem.
>
> This problem still occour in 2.6.23.13 with CFS 24.1
> I see a part of error message: Is something about ipw2200.
> I need find a Camera with good focus, because my cam is not so good to take 
> a picture in distance <=20cm.
>

Please see if you can reproduce it without the CFS backport.

[Please keep Ingo Molnar in the cc for CFS bugs, he is the maintainer.]

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: HPET timer broken using 2.6.23.13 / nanosleep() hangs

2008-01-13 Thread Dhaval Giani

On Sun, Jan 13, 2008 at 08:10:46AM -0500, Andrew Paprocki wrote:
> I applied the patch to my 2.6.23.13 tree and upon reboot it stopped right 
> after:
> 
> Clocksource tsc unstable (delta = ... ns)
> Time: hpet clocksource has been installed.
> 
> It locked up hard.. cursor stopped blinking and SysRq isn't working either.
> 

It obviously is the wrong fix then :).

Adding a few cc's. Hopefully they will know what to do better than me.

> -Andrew
> 
> On Jan 13, 2008 7:03 AM, Dhaval Giani <[EMAIL PROTECTED]> wrote:
> >
> > On Sun, Jan 13, 2008 at 06:10:52AM -0500, Andrew Paprocki wrote:
> > > I started debugging a problem I was having with my sky2 network driver
> > > under 2.6.23.13. The investigation led me to find that the HPET timer
> > > wasn't working at all, causing the sky2 driver to not work properly.
> > > Simple example:
> > >
> > > am2:/sys/devices/system/clocksource/clocksource0# cat current_clocksource
> > > jiffies
> > > am2:/sys/devices/system/clocksource/clocksource0# time sleep 1
> > > real0m1.000s
> > > user0m0.000s
> > > sys 0m0.000s
> > > am2:/sys/devices/system/clocksource/clocksource0# echo tsc > 
> > > current_clocksource
> > > am2:/sys/devices/system/clocksource/clocksource0# time sleep 1
> > > real0m1.005s
> > > user0m0.004s
> > > sys 0m0.000s
> > > am2:/sys/devices/system/clocksource/clocksource0# echo hpet >
> > > current_clocksource
> > > am2:/sys/devices/system/clocksource/clocksource0# time sleep 1
> > > 
> > >
> > > Running strace shows it blocked on nanosleep(). I'm building the
> > > kernel with the processor type set to Athalon64. I've built it with
> > > and without SMP and high-res timers enabled and I get the same result.
> > > My previous 2.6.18-4 kernel works because it does not install HPET as
> > > the default timer. The same behavior occurs in 2.6.24-rc7 git head.
> > > I've attached the config/dmesg below.
> > >
> >
> > It seems the HPET timer was not being assigned any IRQs at all. Can you
> > try the patch at http://lkml.org/lkml/2008/1/12/128 ?
> >
> > Thanks,
> > --
> > regards,
> > Dhaval
> >
> >

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: HPET timer broken using 2.6.23.13 / nanosleep() hangs

2008-01-13 Thread Dhaval Giani

On Sun, Jan 13, 2008 at 06:10:52AM -0500, Andrew Paprocki wrote:
> I started debugging a problem I was having with my sky2 network driver
> under 2.6.23.13. The investigation led me to find that the HPET timer
> wasn't working at all, causing the sky2 driver to not work properly.
> Simple example:
> 
> am2:/sys/devices/system/clocksource/clocksource0# cat current_clocksource
> jiffies
> am2:/sys/devices/system/clocksource/clocksource0# time sleep 1
> real0m1.000s
> user0m0.000s
> sys 0m0.000s
> am2:/sys/devices/system/clocksource/clocksource0# echo tsc > 
> current_clocksource
> am2:/sys/devices/system/clocksource/clocksource0# time sleep 1
> real0m1.005s
> user0m0.004s
> sys 0m0.000s
> am2:/sys/devices/system/clocksource/clocksource0# echo hpet >
> current_clocksource
> am2:/sys/devices/system/clocksource/clocksource0# time sleep 1
> 
> 
> Running strace shows it blocked on nanosleep(). I'm building the
> kernel with the processor type set to Athalon64. I've built it with
> and without SMP and high-res timers enabled and I get the same result.
> My previous 2.6.18-4 kernel works because it does not install HPET as
> the default timer. The same behavior occurs in 2.6.24-rc7 git head.
> I've attached the config/dmesg below.
> 

It seems the HPET timer was not being assigned any IRQs at all. Can you
try the patch at http://lkml.org/lkml/2008/1/12/128 ?

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Updatedb hangs Kernel 2.6.22.9-cfs-v22

2008-01-12 Thread Dhaval Giani

On Sat, Jan 12, 2008 at 04:03:43PM -0200, Renato S. Yamane wrote:
> Hi,
> I can't use updatedb in Debian Etch (stable) using customized Kernel 
> 2.6.22.9-cfs-v22.
>

Hi,

Can you see if it happens with the latest CFS backport. Its been updated
quite a bit since then. You can find it at
http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22.15-v24.1.patch

> When I ran updatedb, after ~1 minute my system hangs and "caps lock" LED is 
> blinking. No log is registered.
>

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 12/11] sched: rt-group: uid-group interface

2008-01-08 Thread Dhaval Giani

On Mon, Jan 07, 2008 at 05:57:42PM +0100, Peter Zijlstra wrote:
> 
> Subject: sched: rt-group: add uid-group interface
> 
> Extend the /sys/kernel/uids// interface to allow setting
> the group's rt_period and rt_runtime.
> 

Hi Peter,

Cool stuff! I will try out these patches and try to give you some
feedback.

One request though, could you please add some documentation to
Documentation/ABI/testing/sysfs-kernel-uids?

Thanks,
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] i386: handle an initrd in highmem (version 2)

2008-01-07 Thread Dhaval Giani

On Mon, Jan 07, 2008 at 09:02:53PM -0800, H. Peter Anvin wrote:
> The boot protocol has until now required that the initrd be located in
> lowmem, which makes the lowmem/highmem boundary visible to the boot
> loader.  This was exported to the bootloader via a compile-time
> field.  Unfortunately, the vmalloc= command-line option breaks this
> part of the protocol; instead of adding yet another hack that affects
> the bootloader, have the kernel relocate the initrd down below the
> lowmem boundary inside the kernel itself.
> 
> Note that this does not rely on HIGHMEM being enabled in the kernel.
> 
> Signed-off-by: H. Peter Anvin <[EMAIL PROTECTED]>
> ---
> Fix crash on NUMA reported by Dhaval Giani (reported as being a kexec issue.)
> 

Yep, it does that. Just tested that on top of the x86 git tree (the mm
queue). It boots.

Tested-by: Dhaval Giani <[EMAIL PROTECTED]>

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2008-01-07 Thread Dhaval Giani

On Mon, Jan 07, 2008 at 12:04:06PM -0800, Christoph Lameter wrote:
> Here is the cleaned version of the patch. Dhaval is testing it.
> 
> 
> quicklists: Only consider memory that can be used with GFP_KERNEL
> 
> Quicklists calculates the size of the quicklists based on the number
> of free pages. This must be the number of free pages that can be
> allocated with GFP_KERNEL. node_page_state() includes the pages in
> ZONE_HIGHMEM and ZONE_MOVABLE which may lead the quicklists to
> become too large causing OOM.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Does the job here for me.

Tested-by: Dhaval Giani <[EMAIL PROTECTED]>

> 
> Index: linux-2.6/mm/quicklist.c
> ===
> --- linux-2.6.orig/mm/quicklist.c 2008-01-07 10:38:13.0 -0800
> +++ linux-2.6/mm/quicklist.c  2008-01-07 10:38:44.0 -0800
> @@ -26,9 +26,17 @@ DEFINE_PER_CPU(struct quicklist, quickli
>  static unsigned long max_pages(unsigned long min_pages)
>  {
>   unsigned long node_free_pages, max;
> + struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
> +
> + node_free_pages =
> +#ifdef CONFIG_ZONE_DMA
> + zone_page_state(&zones[ZONE_DMA], NR_FREE_PAGES) +
> +#endif
> +#ifdef CONFIG_ZONE_DMA32
> + zone_page_state(&zones[ZONE_DMA32], NR_FREE_PAGES) +
> +#endif
> + zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);
> 
> - node_free_pages = node_page_state(numa_node_id(),
> - NR_FREE_PAGES);
>   max = node_free_pages / FRACTION_OF_NODE_MEM;
>   return max(max, min_pages);
>  }

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2008-01-07 Thread Dhaval Giani

On Mon, Jan 07, 2008 at 09:43:38AM -0800, H. Peter Anvin wrote:
> Dhaval Giani wrote:
>> Hi Vivek,
>>
>> I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
>> kexec. It just gets stuck at "Starting new kernel".
>>
>> It does boot normally when booted as the first kernel.
>>
>> Any hints debugging? (x86 architecture)
>>
>
> Hi Dhaval,
>
> Could you give more details of your setup?  I'm trying to figure out 
> exactly what it is in kexec that goes goofy, but there seems to be an 
> enormous number of codepaths, all different, depending on things like which 
> kind of kernel.  The more closely I can replicate your failure, the quicker 
> I should be able to locate the problem.
>

Hi,

sure,

[EMAIL PROTECTED] ~]# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping: 5
cpu MHz : 2500.000
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts sync_rdtsc cid xtpr
bogomips: 4983.50
clflush size: 64

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping: 5
cpu MHz : 2500.000
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts sync_rdtsc cid xtpr
bogomips: 4976.76
clflush size: 64

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping: 5
cpu MHz : 2500.000
cache size  : 1024 KB
physical id : 1
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts sync_rdtsc cid xtpr
bogomips: 4976.89
clflush size: 64

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping: 5
cpu MHz : 2500.000
cache size  : 1024 KB
physical id : 1
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts sync_rdtsc cid xtpr
bogomips: 4976.68
clflush size: 64

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping: 5
cpu MHz : 2500.000
cache size  : 1024 KB
physical id : 2
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts sync_rdtsc cid xtpr
bogomips: 4976.79
clflush size: 64

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 15
model   : 2
model name  : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping: 5
cpu MHz : 2500.000
cache size  : 1024 KB
physical id : 2
siblings: 2
core id : 0
cpu cores   : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs
bts sync_rdtsc cid xtpr
bogomips: 4976.60
clflush size: 64

processor   : 6
vendor_id   : GenuineIn

Re: [PATCH -mm/x86] revert i386: handle an initrd in highmem (Was Re: 2.6.24-rc6-mm1)

2008-01-07 Thread Dhaval Giani

On Mon, Jan 07, 2008 at 08:22:37AM -0800, Randy Dunlap wrote:
> On Mon, 7 Jan 2008 15:56:09 +0100 Ingo Molnar wrote:
> 
> > 
> > * Thomas Gleixner <[EMAIL PROTECTED]> wrote:
> > 
> > > On Mon, 7 Jan 2008, Dhaval Giani wrote:
> > > 
> > > > Hi Andrew, Ingo, Thomas, Peter,
> > > > 
> > > > x86: revert i386: handle an initrd in highmem
> > > > 
> > > > The patch caused a failure while booting a kexec kernel.
> > > > (http://lkml.org/lkml/2008/1/7/42 has the bisect details.)
> > > > 
> > > > The following patch reverts it.
> > > > 
> > > > Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>
> > > 
> > > Thanks for tracking this down. I'll pull the patch from the x86 git 
> > > tree as well.
> > 
> > Dhaval, how about the other problem you had - do you have any guess 
> > what it might be related to?
> > 
> > i'm also wondering - what would be the easiest way to integrate kexec 
> > into an automated test environment. If i have a bzImage kernel, is kexec 
> > still supposed to work? Could i for example do a reboot into a new 
> > (kexec-enabled) kernel via kexec in essence?
> 
> My daily/nightly kernel test runs use kexec to boot the test kernel.
> Well, did thru 2.6.24-rc6-git9, but they fail after that.
> Hopefully this patch fixes things.
> 

hmmm. I don't think so. This revert is from the x86 git tree (-mm) (I think
targetted for 2.6.25). Probably a bisect might help there.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm/x86] revert i386: handle an initrd in highmem (Was Re: 2.6.24-rc6-mm1)

2008-01-07 Thread Dhaval Giani

On Mon, Jan 07, 2008 at 03:56:09PM +0100, Ingo Molnar wrote:
> 
> * Thomas Gleixner <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, 7 Jan 2008, Dhaval Giani wrote:
> > 
> > > Hi Andrew, Ingo, Thomas, Peter,
> > > 
> > > x86: revert i386: handle an initrd in highmem
> > > 
> > > The patch caused a failure while booting a kexec kernel.
> > > (http://lkml.org/lkml/2008/1/7/42 has the bisect details.)
> > > 
> > > The following patch reverts it.
> > > 
> > > Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>
> > 
> > Thanks for tracking this down. I'll pull the patch from the x86 git 
> > tree as well.
> 
> Dhaval, how about the other problem you had - do you have any guess 
> what it might be related to?
> 

other problem? The load_balance_monitor one? (We are still looking into
that one, just seems that se (as usual :) ) is turning out to be null).

> i'm also wondering - what would be the easiest way to integrate kexec 
> into an automated test environment. If i have a bzImage kernel, is kexec 
> still supposed to work? Could i for example do a reboot into a new 
> (kexec-enabled) kernel via kexec in essence?
> 

Yes, I use a bzImage kernel to reboot using kexec. I use a script which
just sets it up for me. (I can send it to you separately).

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm/x86] revert i386: handle an initrd in highmem (Was Re: 2.6.24-rc6-mm1)

2008-01-07 Thread Dhaval Giani

Hi Andrew, Ingo, Thomas, Peter,

x86: revert i386: handle an initrd in highmem

The patch caused a failure while booting a kexec kernel.
(http://lkml.org/lkml/2008/1/7/42 has the bisect details.)

The following patch reverts it.

Signed-off-by: Dhaval Giani <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Thomas Gleixner <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
---
 arch/x86/boot/header.S |5 -
 arch/x86/kernel/setup_32.c |  113 +++--
 2 files changed, 19 insertions(+), 99 deletions(-)

Index: linux-2.6.24-rc6/arch/x86/boot/header.S
===
--- linux-2.6.24-rc6.orig/arch/x86/boot/header.S
+++ linux-2.6.24-rc6/arch/x86/boot/header.S
@@ -195,13 +195,10 @@ cmd_line_ptr: .long   0   # (Header 
version
# can be located anywhere in
# low memory 0x1 or higher.
 
-ramdisk_max:   .long 0x7fff
+ramdisk_max:   .long (-__PAGE_OFFSET-(512 << 20)-1) & 0x7fff
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd
-   # The current kernel allows up to 4 GB,
-   # but leave it at 2 GB to avoid
-   # possible bootloader bugs.
 
 kernel_alignment:  .long CONFIG_PHYSICAL_ALIGN #physical addr alignment
#required for protected mode
Index: linux-2.6.24-rc6/arch/x86/kernel/setup_32.c
===
--- linux-2.6.24-rc6.orig/arch/x86/kernel/setup_32.c
+++ linux-2.6.24-rc6/arch/x86/kernel/setup_32.c
@@ -583,95 +583,6 @@ static void __init relocate_initrd(void)
 
 #endif /* CONFIG_BLK_DEV_INITRD */
 
-#ifdef CONFIG_BLK_DEV_INITRD
-
-static bool do_relocate_initrd = false;
-
-static void __init reserve_initrd(void)
-{
-   unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;
-   unsigned long ramdisk_size  = boot_params.hdr.ramdisk_size;
-   unsigned long ramdisk_end   = ramdisk_image + ramdisk_size;
-   unsigned long end_of_lowmem = max_low_pfn << PAGE_SHIFT;
-   unsigned long ramdisk_here;
-
-   if (ramdisk_end < ramdisk_image) {
-   printk(KERN_ERR "initrd wraps around end of memory\n"
-  KERN_ERR "disabling initrd\n");
-   initrd_start = 0;
-   return;
-   }
-   if (ramdisk_size >= end_of_lowmem/2) {
-   printk(KERN_ERR "initrd too large to handle\n"
-  KERN_ERR "disabling initrd\n");
-   initrd_start = 0;
-   return;
-   }
-   if (ramdisk_end <= end_of_lowmem) {
-   /* All in lowmem, easy case */
-   reserve_bootmem(ramdisk_image, ramdisk_size);
-   initrd_start = ramdisk_image + PAGE_OFFSET;
-   initrd_end = initrd_start+ramdisk_size;
-   return;
-   }
-
-   /* We need to move the initrd down into lowmem */
-   ramdisk_here = (end_of_lowmem - ramdisk_size) & PAGE_MASK;
-
-   /* Note: this includes all the lowmem currently occupied by
-  the initrd, we rely on that fact to keep the data intact. */
-   reserve_bootmem(ramdisk_here, ramdisk_size);
-   initrd_start = ramdisk_here + PAGE_OFFSET;
-   initrd_end   = initrd_start + ramdisk_size;
-
-   do_relocate_initrd = true;
-}
-
-#define MAX_MAP_CHUNK  (NR_FIX_BTMAPS << PAGE_SHIFT)
-
-static void __init relocate_initrd(void)
-{
-   unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;
-   unsigned long ramdisk_size  = boot_params.hdr.ramdisk_size;
-   unsigned long end_of_lowmem = max_low_pfn << PAGE_SHIFT;
-   unsigned long ramdisk_here;
-   unsigned long slop, clen, mapaddr;
-   char *p, *q;
-
-   if (!do_relocate_initrd)
-   return; /* Nothing to do here... */
-
-   ramdisk_here = initrd_start - PAGE_OFFSET;
-   q = (char *)initrd_start;
-
-   /* Copy any lowmem portion of the initrd */
-   if (ramdisk_image < end_of_lowmem) {
-   clen = end_of_lowmem - ramdisk_image;
-   p = (char *)__va(ramdisk_image);
-   memcpy(q, p, clen);
-   q += clen;
-   ramdisk_image += clen;
-   ramdisk_size  -= clen;
-   }
-
-   /* Copy the highmem portion of the initrd */
-   while (ramdisk_size) {
-   slop = ramdisk_image & ~PAGE_MASK;
-   clen = ramdisk_size;
-   if (clen > MAX_MAP_CHUNK-slop)
-   clen = MAX_MAP

Re: kexec refuses to boot latest -mm

2008-01-07 Thread Dhaval Giani

> commit 7dd838ea7afa42a8840cf0e262d5892346ecf379
> Author: H. Peter Anvin <[EMAIL PROTECTED]>
> Date:   Sat Jan 5 13:27:04 2008 +0100
> 

sorry, forgot the subject, it is

i386: handle an initrd in highmem

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2008-01-07 Thread Dhaval Giani

On Sun, Jan 06, 2008 at 02:37:29PM +0100, Ingo Molnar wrote:
> 
> * Dhaval Giani <[EMAIL PROTECTED]> wrote:
> 
> > So I went ahead and bisected -mm, and the culprit is git-x86. It boots 
> > fine before it, but with git-x86 applied, it fails to boot.
> > 
> > Ingo/Thomas, could you please point me to the git-x86 tree so that I 
> > can bisect it? (with instructions on how to pull the -mm branch, I 
> > managed to pull the master branch, but not the -mm branch)
> 
> sure:
> 
> --{ x86.git instructions }-->
> 
>  git-clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 
> linux-2.6.git
>  cd linux-2.6.git
>  git-branch x86
>  git-checkout x86
>  git-pull git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git 
> mm
>  git-log [EMAIL PROTECTED] # see what's in #mm
> 

Thanks for the instructions. I went ahead and did the bisect, and

commit 7dd838ea7afa42a8840cf0e262d5892346ecf379
Author: H. Peter Anvin <[EMAIL PROTECTED]>
Date:   Sat Jan 5 13:27:04 2008 +0100

is the first bad commit.

git bisect log

[EMAIL PROTECTED]:~/mm$ git bisect log
git-bisect start
# good: [439f61b9f9ebbf84fb7e6b3539fc3794e046bbb9] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
git-bisect good 439f61b9f9ebbf84fb7e6b3539fc3794e046bbb9
# bad: [5c9b01f1566f4989bb732a46c9a9c86f3e7ef9ae] x86: unregister PIT
clocksource when PIT is disabled
git-bisect bad 5c9b01f1566f4989bb732a46c9a9c86f3e7ef9ae
# good: [203bcdc3d2725751e4dc0e3749edbab3c8bf9132] move switch_to macro
to system.h
git-bisect good 203bcdc3d2725751e4dc0e3749edbab3c8bf9132
# good: [81886a1b16eec7a27f3ec94f06f585685ad08bf2] x86: i387 renaming
git-bisect good 81886a1b16eec7a27f3ec94f06f585685ad08bf2
# bad: [fc0ed2251da6ef8ad3295a02b39bed9859dc0cfc] x86: unify
arch/x86/kernel/Makefile(s)
git-bisect bad fc0ed2251da6ef8ad3295a02b39bed9859dc0cfc
# good: [6fa16686ee2b0cc6b02a59671ecb9d4c85beae64] change assembly
definition of paravirt_patch_site
git-bisect good 6fa16686ee2b0cc6b02a59671ecb9d4c85beae64
# bad: [1184caa02fcc2ecd780b145f4c69cc4484a8cde4] git-x86:
arch/x86/math-emu/errors.c: fix printk warnings
git-bisect bad 1184caa02fcc2ecd780b145f4c69cc4484a8cde4
# good: [e8189d0ca1970578f4b28f92928581313633be5a] x86_64 patching
functions
git-bisect good e8189d0ca1970578f4b28f92928581313633be5a
# bad: [7dd838ea7afa42a8840cf0e262d5892346ecf379] i386: handle an initrd
in highmem
git-bisect bad 7dd838ea7afa42a8840cf0e262d5892346ecf379
# good: [8cff7d1301d37b15add1dabe6ac8a92e6c0a1c7f] i386 EFI runtime
service support: fixes in sync with x86_64 support
git-bisect good 8cff7d1301d37b15add1dabe6ac8a92e6c0a1c7f
# good: [ac3836d03c718a553634ff21c1157bb823567f4f] x86:
arch/x86/kernel/setup_32.c whitespace fixes
git-bisect good ac3836d03c718a553634ff21c1157bb823567f4f

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2008-01-04 Thread Dhaval Giani

On Fri, Jan 04, 2008 at 05:58:16PM +0530, Dhaval Giani wrote:
> On Thu, Jan 03, 2008 at 10:42:00PM +0100, Rafael J. Wysocki wrote:
> > On Thursday, 3 of January 2008, Dhaval Giani wrote:
> > > On Mon, Dec 31, 2007 at 10:08:43AM -0500, Vivek Goyal wrote:
> > > > On Sat, Dec 29, 2007 at 11:11:13AM +0530, Dhaval Giani wrote:
> > > > > On Fri, Dec 28, 2007 at 09:27:39AM -0500, Vivek Goyal wrote:
> > > > > > On Fri, Dec 28, 2007 at 06:15:32PM +0530, Dhaval Giani wrote:
> > > > > > > Hi Vivek,
> > > > > > > 
> 
> I've no clue what I managed to do wrong last night, probably tried to boot
> 2.6.24-rc6-mm1 thinking it was 2.6.24-rc6. 2.6.24-rc6 boots, but -mm
> does not. Sorry for the noise.
> 

So I went ahead and bisected -mm, and the culprit is git-x86. It boots
fine before it, but with git-x86 applied, it fails to boot.

Ingo/Thomas, could you please point me to the git-x86 tree so that I can
bisect it? (with instructions on how to pull the -mm branch, I managed
to pull the master branch, but not the -mm branch)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2008-01-04 Thread Dhaval Giani

On Thu, Jan 03, 2008 at 10:42:00PM +0100, Rafael J. Wysocki wrote:
> On Thursday, 3 of January 2008, Dhaval Giani wrote:
> > On Mon, Dec 31, 2007 at 10:08:43AM -0500, Vivek Goyal wrote:
> > > On Sat, Dec 29, 2007 at 11:11:13AM +0530, Dhaval Giani wrote:
> > > > On Fri, Dec 28, 2007 at 09:27:39AM -0500, Vivek Goyal wrote:
> > > > > On Fri, Dec 28, 2007 at 06:15:32PM +0530, Dhaval Giani wrote:
> > > > > > Hi Vivek,
> > > > > > 
> > > > > > I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
> > > > > > kexec. It just gets stuck at "Starting new kernel".
> > > > > > 
> > > > > > It does boot normally when booted as the first kernel.
> > > > > > 
> > > > > > Any hints debugging? (x86 architecture)
> > > > > 
> > > > > I generally try few things.
> > > > > 
> > > > > - Specify earlyprintk= parameter for second kernel and see if control
> > > > >   is reaching to second kernel.
> > > > > 
> > > > > - Otherwise specify --console-serial parameter on "kexec -l" 
> > > > > commandline
> > > > >   and it should display a message "I am in purgatory" on serial 
> > > > > console.
> > > > >   This will just mean that control has reached at least till 
> > > > > purgatory.
> > > > > 
> > > > 
> > > > Ok, so it reaches till here. I get "I'm in purgatory" on the console.
> > > 
> > > One more thing. Is 2.6.24-rc6 working properly?
> > > 
> > 
> > 2.6.24-rc5 boots, so does 2.6.24-rc5-mm1. 2.6.24-rc6 does not boot, nor
> > does 2.6.24-rc6-mm1. Its a regression.
> 
> Added to the list of reported regressions as:
> http://bugzilla.kernel.org/show_bug.cgi?id=9682
> 

I've no clue what I managed to do wrong last night, probably tried to boot
2.6.24-rc6-mm1 thinking it was 2.6.24-rc6. 2.6.24-rc6 boots, but -mm
does not. Sorry for the noise.

With the help of the outbs I have figured out that it is dying somewhere
in setup_memory(). Will look deeper into it.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2008-01-03 Thread Dhaval Giani

On Mon, Dec 31, 2007 at 10:08:43AM -0500, Vivek Goyal wrote:
> On Sat, Dec 29, 2007 at 11:11:13AM +0530, Dhaval Giani wrote:
> > On Fri, Dec 28, 2007 at 09:27:39AM -0500, Vivek Goyal wrote:
> > > On Fri, Dec 28, 2007 at 06:15:32PM +0530, Dhaval Giani wrote:
> > > > Hi Vivek,
> > > > 
> > > > I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
> > > > kexec. It just gets stuck at "Starting new kernel".
> > > > 
> > > > It does boot normally when booted as the first kernel.
> > > > 
> > > > Any hints debugging? (x86 architecture)
> > > 
> > > I generally try few things.
> > > 
> > > - Specify earlyprintk= parameter for second kernel and see if control
> > >   is reaching to second kernel.
> > > 
> > > - Otherwise specify --console-serial parameter on "kexec -l" commandline
> > >   and it should display a message "I am in purgatory" on serial console.
> > >   This will just mean that control has reached at least till purgatory.
> > > 
> > 
> > Ok, so it reaches till here. I get "I'm in purgatory" on the console.
> 
> One more thing. Is 2.6.24-rc6 working properly?
> 

2.6.24-rc5 boots, so does 2.6.24-rc5-mm1. 2.6.24-rc6 does not boot, nor
does 2.6.24-rc6-mm1. Its a regression.

i will add more outbs and respond back and try a bisect if that does not
help.

Thanks
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2008-01-02 Thread Dhaval Giani

On Wed, Jan 02, 2008 at 01:54:12PM -0800, Christoph Lameter wrote:
> Just traced it again on my system: It is okay for the number of pages on 
> the quicklist to reach the high count that we see (although the 16 bit 
> limits are weird. You have around 4GB of memory in the system?). Up to 
> 1/16th of free memory of a node can be allocated for quicklists (this 
> allows the effective shutting down and restarting of large amounts of 
> processes)
> 
> The problem may be that this is run on a HIGHMEM system and the 
> calculation of allowable pages on the quicklists does not take into 
> account that highmem pages are not usable for quicklists (not sure about 
> ZONE_MOVABLE on i386. Maybe we need to take that into account as well?)
> 
> Here is a patch that removes the HIGHMEM portion from the calculation. 
> Does this change anything:
> 

Yep. This one hits it. I don't see the obvious signs of the oom
happening in the 5 mins I have run the script. I will let it run for
some more time.

Thanks!
-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2008-01-02 Thread Dhaval Giani

On Thu, Jan 03, 2008 at 09:29:42AM +0530, Dhaval Giani wrote:
> On Wed, Jan 02, 2008 at 01:54:12PM -0800, Christoph Lameter wrote:
> > Just traced it again on my system: It is okay for the number of pages on 
> > the quicklist to reach the high count that we see (although the 16 bit 
> > limits are weird. You have around 4GB of memory in the system?). Up to 
> > 1/16th of free memory of a node can be allocated for quicklists (this 
> > allows the effective shutting down and restarting of large amounts of 
> > processes)
> > 
> > The problem may be that this is run on a HIGHMEM system and the 
> > calculation of allowable pages on the quicklists does not take into 
> > account that highmem pages are not usable for quicklists (not sure about 
> > ZONE_MOVABLE on i386. Maybe we need to take that into account as well?)
> > 
> > Here is a patch that removes the HIGHMEM portion from the calculation. 
> > Does this change anything:
> > 
> 
> Yep. This one hits it. I don't see the obvious signs of the oom
> happening in the 5 mins I have run the script. I will let it run for
> some more time.
> 

Yes, no oom even after 20 mins of running (which is double the normal
time for the oom to occur), also no changes in free lowmem.

Thanks for the fix. Feel free to add a 

Tested-by: Dhaval Giani <[EMAIL PROTECTED]>

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2007-12-31 Thread Dhaval Giani

On Mon, Dec 31, 2007 at 10:06:58AM -0500, Vivek Goyal wrote:
> On Sat, Dec 29, 2007 at 11:11:13AM +0530, Dhaval Giani wrote:
> > On Fri, Dec 28, 2007 at 09:27:39AM -0500, Vivek Goyal wrote:
> > > On Fri, Dec 28, 2007 at 06:15:32PM +0530, Dhaval Giani wrote:
> > > > Hi Vivek,
> > > > 
> > > > I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
> > > > kexec. It just gets stuck at "Starting new kernel".
> > > > 
> > > > It does boot normally when booted as the first kernel.
> > > > 
> > > > Any hints debugging? (x86 architecture)
> > > 
> > > I generally try few things.
> > > 
> > > - Specify earlyprintk= parameter for second kernel and see if control
> > >   is reaching to second kernel.
> > > 
> > > - Otherwise specify --console-serial parameter on "kexec -l" commandline
> > >   and it should display a message "I am in purgatory" on serial console.
> > >   This will just mean that control has reached at least till purgatory.
> > > 
> > 
> > Ok, so it reaches till here. I get "I'm in purgatory" on the console.
> > 
> 
> Ok. So it atleast reaches till purgatory. My hunch is that it also reaches
> the second kernel but dies early in that kernel before any message can
> be printed on screen.
> 
> Can you put some "outb()" messages on serial console in start_kernel() in
> second kernel and see up to what point do you reach.
> 
> My 2.6.24-rc6-mm1 kernel panics during boot on 32bit box. Let me try it on
> 64 bit box.
> 

Apply the hotfixes :-)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2007-12-30 Thread Dhaval Giani

On Sun, Dec 30, 2007 at 03:01:16PM +0100, Ingo Molnar wrote:
> 
> * Christoph Lameter <[EMAIL PROTECTED]> wrote:
> 
> > Index: linux-2.6/arch/x86/mm/pgtable_32.c
> > ===
> > --- linux-2.6.orig/arch/x86/mm/pgtable_32.c 2007-12-26 12:55:10.0 
> > -0800
> > +++ linux-2.6/arch/x86/mm/pgtable_32.c  2007-12-26 12:55:54.0 
> > -0800
> > @@ -366,6 +366,15 @@ void pgd_free(pgd_t *pgd)
> > }
> > /* in the non-PAE case, free_pgtables() clears user pgd entries */
> > quicklist_free(0, pgd_dtor, pgd);
> > +
> > +   /*
> > +* We must call check_pgd_cache() here because the pgd is freed after
> > +* tlb flushing and the call to check_pgd_cache. In some cases the VM
> > +* may not call tlb_flush_mmu during process termination (??).
> 
> that's incorrect i think: during process termination exit_mmap() calls 
> tlb_finish_mmu() unconditionally which calls tlb_flush_mmu().
> 
> > +* If this is repeated then we may never call check_pgd_cache.
> > +* The quicklist will grow and grow. So call check_pgd_cache here.
> > +*/
> > +   check_pgt_cache();
> >  }
> 
> so we still dont seem to understand the failure mode well enough. This 
> also looks like a quite dangerous change so late in the v2.6.24 cycle. 
> Does it really fix the OOM? If yes, why exactly?
> 

No it does not. I've sent out some more information if it helps, will
send to you separately.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2007-12-28 Thread Dhaval Giani

On Fri, Dec 28, 2007 at 09:27:39AM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2007 at 06:15:32PM +0530, Dhaval Giani wrote:
> > Hi Vivek,
> > 
> > I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
> > kexec. It just gets stuck at "Starting new kernel".
> > 
> > It does boot normally when booted as the first kernel.
> > 
> > Any hints debugging? (x86 architecture)
> 
> I generally try few things.
> 
> - Specify earlyprintk= parameter for second kernel and see if control
>   is reaching to second kernel.
> 
> - Otherwise specify --console-serial parameter on "kexec -l" commandline
>   and it should display a message "I am in purgatory" on serial console.
>   This will just mean that control has reached at least till purgatory.
> 

Ok, so it reaches till here. I get "I'm in purgatory" on the console.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kexec refuses to boot latest -mm

2007-12-28 Thread Dhaval Giani

On Fri, Dec 28, 2007 at 02:54:45PM -0500, Neil Horman wrote:
> On Fri, Dec 28, 2007 at 06:15:32PM +0530, Dhaval Giani wrote:
> > Hi Vivek,
> > 
> > I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
> > kexec. It just gets stuck at "Starting new kernel".
> > 
> > It does boot normally when booted as the first kernel.
> > 
> > Any hints debugging? (x86 architecture)
> > 
> > -- 
> > regards,
> > Dhaval
> > 
> Out of curiosity, how are you booting the kexec kernel?  Are you doing a kexec
> -l, or a kexec -p, followed by a system panic?  If its the later, what are you
> doing to panic the system?  sysrq-c, custom module executing explicit
> crash-code, hang followed by nmi panic?
> 

kexec -l, followed by a normal reboot.

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6-mm1 Kernel panics at different functions ()

2007-12-28 Thread Dhaval Giani

> 
> While doing the git bisect, following panic was seen
> 
> Unable to handle kernel paging request at 401e RIP: 
>  [] load_balance_monitor+0x15e/0x2a4
> PGD 0 
> Oops:  [1] SMP 
> last sysfs file: 
> /devices/pci:00/:00:0a.0/:02:04.0/host0/target0:0:6/0:0:6:0/type
> CPU 1 
> Modules linked in:
> Pid: 15, comm: load_balance_mo Not tainted 2.6.24-rc6-mm1-autokern1 #1
> RIP: 0010:[]  [] 
> load_balance_monitor+0x15e/0x2a4
> RSP: :81007ffb7eb0  EFLAGS: 00010297
> RAX:  RBX: 0001 RCX: 
> RDX: 401e RSI: 81007ffb7ed8 RDI: 
> RBP: 81007ffb7f20 R08: 81007ffb6000 R09: 81007ffb6000
> R10: 81007ffb6000 R11:  R12: 
> R13: 0003 R14: 0800 R15: 8101fe997f00
> FS:  () GS:8100e3b1() knlGS:f73e1bb0
> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> CR2: 401e CR3: 00201000 CR4: 06e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process load_balance_mo (pid: 15, threadinfo 81007ffb6000, task 
> 81007ff94790)
> Stack:  2000  810001009cc0 0001e3b29d90
>  0080 000f 81007f0be780 000f
>  00017ffb7f20  fffc 
> Call Trace:
>  [] load_balance_monitor+0x0/0x2a4
>  [] kthread+0x3d/0x63
>  [] child_rip+0xa/0x12
>  [] kthread+0x0/0x63
>  [] child_rip+0x0/0x12
> 
> 
> Code: 48 8b 04 c2 48 8b 10 48 01 55 98 e8 ce 40 12 00 83 f8 07 41 
> RIP  [] load_balance_monitor+0x15e/0x2a4
>  RSP 
> CR2: 401e
> 

Hmmm. Looking into it :-).

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

kexec refuses to boot latest -mm

2007-12-28 Thread Dhaval Giani

Hi Vivek,

I can't seem to get the latest -mm (2.6.24-rc6-mm1) to boot with
kexec. It just gets stuck at "Starting new kernel".

It does boot normally when booted as the first kernel.

Any hints debugging? (x86 architecture)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2007-12-28 Thread Dhaval Giani

On Thu, Dec 27, 2007 at 11:22:34AM -0800, Christoph Lameter wrote:
> On Thu, 27 Dec 2007, Dhaval Giani wrote:
> 
> > anything specific you are looking for? I still hit the oom.
> 
> Weird WTH is this? You run an unmodified upstream tree? Can you add a 
> printk in quicklist_trim that shows
> 

Hi,

I am running 2.6.24-rc5-mm1 here.

> A) that it is called
> 
> B) what the control values q->nr_pages and min_pages are?
> 

Trying to print these using printks renders the system unbootable. With
help from RAS folks around me, managed to get a systemtap script, 

probe kernel.statement("[EMAIL PROTECTED]/quicklist.c:56")
{
printf(" q->nr_pages is %d, min_pages is %d > %s\n",
$q->nr_pages, $$
min_pages, execname());
}

we managed to get your required information. Last 10,000 lines are
attached (The uncompressed file comes to 500 kb).

Hope it helps.

Thanks,
-- 
regards,
Dhaval

systp.out.1.bz2
Description: BZip2 compressed data

Circular locking dependency

2007-12-24 Thread Dhaval Giani

Hi,

Just hit this on sched-devel. (not sure how to reproduce it yet, can't
try now. I believe i can hit it on mainline as well as there is nothing
scheduler specific).

===
[ INFO: possible circular locking dependency detected ]
2.6.24-rc6 #1
---
bash/17982 is trying to acquire lock:
 (&journal->j_list_lock){--..}, at: []
__journal_try_to_free_buffer+0x2a/0x8a

but task is already holding lock:
 (inode_lock){--..}, at: [] drop_pagecache_sb+0x12/0x74

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (inode_lock){--..}:
   [] check_prev_add+0xb8/0x1ad
   [] check_prevs_add+0x5d/0xcf
   [] validate_chain+0x286/0x300
   [] __lock_acquire+0x67f/0x6ff
   [] lock_acquire+0x71/0x8b
   [] _spin_lock+0x2b/0x38
   [] __mark_inode_dirty+0xd0/0x15b
   [] __set_page_dirty+0x10c/0x11b
   [] mark_buffer_dirty+0x9a/0xa1
   [] __journal_temp_unlink_buffer+0xbf/0xc3
   [] __journal_unfile_buffer+0xb/0x15
   [] __journal_refile_buffer+0x3c/0x86
   [] journal_commit_transaction+0x89c/0xa05
   [] kjournald+0xab/0x1ff
   [] kthread+0x37/0x59
   [] kernel_thread_helper+0x7/0x10
   [] 0x

-> #0 (&journal->j_list_lock){--..}:
   [] check_prev_add+0x2e/0x1ad
   [] check_prevs_add+0x5d/0xcf
   [] validate_chain+0x286/0x300
   [] __lock_acquire+0x67f/0x6ff
   [] lock_acquire+0x71/0x8b
   [] _spin_lock+0x2b/0x38
   [] __journal_try_to_free_buffer+0x2a/0x8a
   [] journal_try_to_free_buffers+0x61/0x9e
   [] ext3_releasepage+0x68/0x74
   [] try_to_release_page+0x33/0x47
   [] invalidate_complete_page+0x1e/0x35
   [] __invalidate_mapping_pages+0x6b/0xc2
   [] drop_pagecache_sb+0x4c/0x74
   [] drop_pagecache+0x4a/0x78
   [] drop_caches_sysctl_handler+0x36/0x4e
   [] proc_sys_write+0x6b/0x85
   [] vfs_write+0x8c/0x10b
   [] sys_write+0x3d/0x61
   [] sysenter_past_esp+0x5f/0xa5
   [] 0x

other info that might help us debug this:

2 locks held by bash/17982:
 #0:  (&type->s_umount_key#15){}, at: []
drop_pagecache+0x3d/0x78
 #1:  (inode_lock){--..}, at: [] drop_pagecache_sb+0x12/0x74

stack backtrace:
Pid: 17982, comm: bash Not tainted 2.6.24-rc6 #1
 [] show_trace_log_lvl+0x19/0x2e
 [] show_trace+0x12/0x14
 [] dump_stack+0x6c/0x72
 [] print_circular_bug_tail+0x5f/0x68
 [] check_prev_add+0x2e/0x1ad
 [] check_prevs_add+0x5d/0xcf
 [] validate_chain+0x286/0x300
 [] __lock_acquire+0x67f/0x6ff
 [] lock_acquire+0x71/0x8b
 [] _spin_lock+0x2b/0x38
 [] __journal_try_to_free_buffer+0x2a/0x8a
 [] journal_try_to_free_buffers+0x61/0x9e
 [] ext3_releasepage+0x68/0x74
 [] try_to_release_page+0x33/0x47
 [] invalidate_complete_page+0x1e/0x35
 [] __invalidate_mapping_pages+0x6b/0xc2
 [] drop_pagecache_sb+0x4c/0x74
 [] drop_pagecache+0x4a/0x78
 [] drop_caches_sysctl_handler+0x36/0x4e
 [] proc_sys_write+0x6b/0x85
 [] vfs_write+0x8c/0x10b
 [] sys_write+0x3d/0x61
 [] sysenter_past_esp+0x5f/0xa5
 ===
[EMAIL PROTECTED] kernel]# 

Last thing I did was echo 1 > /proc/sys/vm/drop_cache

(Not sure whom to cc, hopefully others will know better, also no time to
debug further, sorry!)

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2007-12-20 Thread Dhaval Giani

> > It was just
> > 
> > while echo ; do cat /sys/kernel/ ; done
> > 
> > it's all in the email threads somewhere..
> 
> The patch that was posted in the thread that I mentioned earlier is here. 
> I ran the test for 15 minutes and things are still fine.
> 
> 
> 
> quicklist: Set tlb->need_flush if pages are remaining in quicklist 0
> 
> This ensures that the quicklists are drained. Otherwise draining may only 
> occur when the processor reaches an idle state.
> 

Hi Christoph,

No, it does not stop the oom I am seeing here.

Thanks,

> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> Index: linux-2.6/include/asm-generic/tlb.h
> ===
> --- linux-2.6.orig/include/asm-generic/tlb.h  2007-12-13 14:45:38.0 
> -0800
> +++ linux-2.6/include/asm-generic/tlb.h   2007-12-13 14:51:07.0 
> -0800
> @@ -14,6 +14,7 @@
>  #define _ASM_GENERIC__TLB_H
> 
>  #include 
> +#include 
>  #include 
>  #include 
> 
> @@ -85,6 +86,9 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  static inline void
>  tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long 
> end)
>  {
> +#ifdef CONFIG_QUICKLIST
> + tlb->need_flush += &__get_cpu_var(quicklist)[0].nr_pages != 0;
> +#endif
>   tlb_flush_mmu(tlb, start, end);
> 
>   /* keep the page table cache within bounds */

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-stable causes oomkiller to be invoked

2007-12-15 Thread Dhaval Giani

On Fri, Dec 14, 2007 at 10:00:30PM -0800, Andrew Morton wrote:
> On Sat, 15 Dec 2007 09:22:00 +0530 Dhaval Giani <[EMAIL PROTECTED]> wrote:
> 
> > > Is it really the case that the bug only turns up when you run tests like
> > > 
> > >   while echo; do cat /sys/kernel/kexec_crash_loaded; done
> > > and
> > >   while echo; do cat /sys/kernel/uevent_seqnum ; done;
> > > 
> > > or will any fork-intensive workload also do it?  Say,
> > > 
> > >   while echo ; do true ; done
> > > 
> > 
> > This does not leak, but having a simple text file and reading it in a
> > loop causes it.
> 
> hm.
> 
> > > ?
> > > 
> > > Another interesting factoid here is that after the oomkilling you 
> > > slabinfo has
> > > 
> > > mm_struct 38 9858471 : tunables   32   16
> > > 8 : slabdata 14 14  0 : globalstat278119649   31  
> > >   01000 : cpustat 368800  
> > > 11864 368920  11721
> > > 
> > > so we aren't leaking mm_structs.  In fact we aren't leaking anything from
> > > slab.   But we are leaking pgds.
> > > 
> > > iirc the most recent change we've made in the pgd_t area is the quicklist
> > > management which went into 2.6.22-rc1.  You say the bug was present in
> > > 2.6.22.  Can you test 2.6.21?  
> > 
> > Nope, leak is not present in 2.6.21.7
> 
> Could you try this debug patch please?
> 

Here is the dmesg with that patch,

use, ignoring.
PCI: Unable to reserve mem region #2:[EMAIL PROTECTED] for device :08:0a.1
aic7xxx:  at PCI 8/10/1
aic7xxx: I/O ports already in use, ignoring.
megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
megasas: 00.00.03.16-rc1 Thu. Nov. 07 10:09:32 PDT 2007
st: Version 20070203, fixed bufsize 32768, s/g segs 256
osst :I: Tape driver with OnStream support version 0.99.4
osst :I: $Id: osst.c,v 1.73 2005/01/01 21:13:34 wriede Exp $
sd 1:0:0:0: [sda] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sda] Mode Sense: cb 00 00 08
sd 1:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 1:0:0:0: [sda] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sda] Mode Sense: cb 00 00 08
sd 1:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
 sda: sda1
sd 1:0:0:0: [sda] Attached SCSI disk
sd 1:0:1:0: [sdb] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:1:0: [sdb] Write Protect is off
sd 1:0:1:0: [sdb] Mode Sense: cb 00 00 08
sd 1:0:1:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 1:0:1:0: [sdb] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:1:0: [sdb] Write Protect is off
sd 1:0:1:0: [sdb] Mode Sense: cb 00 00 08
sd 1:0:1:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
 sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:1:0: [sdb] Attached SCSI disk
sd 1:0:2:0: [sdc] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:2:0: [sdc] Write Protect is off
sd 1:0:2:0: [sdc] Mode Sense: cb 00 00 08
sd 1:0:2:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 1:0:2:0: [sdc] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:2:0: [sdc] Write Protect is off
sd 1:0:2:0: [sdc] Mode Sense: cb 00 00 08
sd 1:0:2:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
 sdc: sdc1 sdc2
sd 1:0:2:0: [sdc] Attached SCSI disk
sd 1:0:3:0: [sdd] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:3:0: [sdd] Write Protect is off
sd 1:0:3:0: [sdd] Mode Sense: cb 00 00 08
sd 1:0:3:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 1:0:3:0: [sdd] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:3:0: [sdd] Write Protect is off
sd 1:0:3:0: [sdd] Mode Sense: cb 00 00 08
sd 1:0:3:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
 sdd: sdd1 sdd2 sdd3
sd 1:0:3:0: [sdd] Attached SCSI disk
sd 1:0:4:0: [sde] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:4:0: [sde] Write Protect is off
sd 1:0:4:0: [sde] Mode Sense: cb 00 00 08
sd 1:0:4:0: [sde] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
sd 1:0:4:0: [sde] 71096640 512-byte hardware sectors (36401 MB)
sd 1:0:4:0: [sde] Write Protect is off
sd 1:0:4:0: [sde] Mode Sense: cb 00 00 08
sd 1:0:4:0: [sde] Write cache: disabled, read cache: enabled, doesn't support 
DPO or FUA
 sde: sde1
sd 1:0:4:0: [sde] Attached SCSI disk
sd 1:0:5:0: [sdf] 71096640 512-byte hardware sectors (36401

Re: 2.6.22-stable causes oomkiller to be invoked

2007-12-14 Thread Dhaval Giani

> Is it really the case that the bug only turns up when you run tests like
> 
>   while echo; do cat /sys/kernel/kexec_crash_loaded; done
> and
>   while echo; do cat /sys/kernel/uevent_seqnum ; done;
> 
> or will any fork-intensive workload also do it?  Say,
> 
>   while echo ; do true ; done
> 

This does not leak, but having a simple text file and reading it in a
loop causes it.

> ?
> 
> Another interesting factoid here is that after the oomkilling you slabinfo has
> 
> mm_struct 38 9858471 : tunables   32   168 : 
> slabdata 14 14  0 : globalstat278119649   31  
>   01000 : cpustat 368800  11864 368920  
> 11721
> 
> so we aren't leaking mm_structs.  In fact we aren't leaking anything from
> slab.   But we are leaking pgds.
> 
> iirc the most recent change we've made in the pgd_t area is the quicklist
> management which went into 2.6.22-rc1.  You say the bug was present in
> 2.6.22.  Can you test 2.6.21?  

Nope, leak is not present in 2.6.21.7

-- 
regards,
Dhaval
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 157 matches

Mail list logo