Questions about data protection in the kernel
Full disclosure, I am a software engineer for Arcserve, responsible for maintaining our linux kernel driver component of our data protection software suites. I am here looking for help seeking answers to two specific questions. Not knowing who to ask or really how to communicate some of my questions, it seemed most appropriate to start my journey here with kernel newbies. Why does the kernel only provide disk snapshot capabilities via Device Mapper (and lvm)? I am aware of multiple companies that offer linux products that come with their own kernel modules to provide full featured snapshot capabilities. Some of these linux offerings are decades old and some are open source now. If they were to require device mapper to be setup previously, it would be difficult to insert themselves into many environments. All of the drivers seem to perform the same hook by replacing the fops->submit_bio (previously known as q->make_request_fn, and more on this in the next question). It seems to me there is at least some demand for data protection functionality outside of device mapper, and this hook could be so much cleaner if it was officially supported by the kernel. (I would be thrilled to learn that the answer is "because nobody has volunteered to write it.") In context of Multi-Queue Block IO (blk-mq), what is the future of the older Single Queue interface? I have puzzled together some of blk-mq's history simply by interacting with it as it has become necessary; reading code and git commit messages. What I have gleaned so far has me wondering if and when the future will be Multi-Queue only. I have read as many LWN articles I can find on the subject, but the future of IO queuing is still unclear to me. Thank you for your time and helping me to learn how to engage with the linux kernel community. Any feedback on how and where to ask questions would also be greatly appreciated. _ [cid:arcserve-email-logo_566d469b-c8dc-46eb-909b-300e3f3e47a1.jpg]<https://arcserve.com/> Kai Meyer | Sr. Software Engineer Office: 801.871.2765 | Mobile: | kai.me...@arcserve.com arcserve.com<https://www.arcserve.com/> | Twitter<https://twitter.com/Arcserve> | LinkedIn<https://www.linkedin.com/company/arcserve/> | YouTube<https://www.youtube.com/user/arcserve> _ If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: VFAT i_pos value
On 12/02/2011 11:23 PM, OGAWA Hirofumi wrote: > OGAWA Hirofumi writes: > >> Kai Meyer writes: >> >>> Thanks for the helpful response. I'm not entirely sure I understand the >>> next part though. I hacked a dirty entry dumper tool: >>> >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> int main(int argc, char** argv) >>> { >>> off_t pos = atoi(argv[2]); >>> unsigned long block; >>> off_t sector; >>> unsigned int offset; >>> int fd = open(argv[1], O_RDONLY); >>> char buf[512]; >>> struct msdos_dir_entry dirent; >>> block = pos / (4096 / 32); >>> sector = block * 8; >>> offset = pos % (4096 / 32); >>> printf("block %lu, sector %lu, offset %u\n", block, sector, >>> offset); >>> lseek(fd, sector * 512, SEEK_SET); >>> if (read(fd, buf, 512)< 0) { >>> fprintf(stderr, "Unable to read from device %s\n", >>> argv[1]); >>> return -1; >>> } >>> memcpy(&dirent, buf + offset, sizeof(dirent)); >>> printf("name %s\n", dirent.name); >>> printf("attr %u\n", dirent.attr); >>> printf("lcase %u\n", dirent.lcase); >>> printf("ctime_cs %u\n", dirent.ctime_cs); >>> printf("ctime %u\n", dirent.ctime); >>> printf("cdate %u\n", dirent.cdate); >>> printf("adate %u\n", dirent.adate); >>> printf("starthi %u\n", dirent.starthi); >>> printf("time %u\n", dirent.time); >>> printf("date %u\n", dirent.date); >>> printf("start %u\n", dirent.start); >>> printf("size %u\n", dirent.size); >>> } >>> >>> Here's what it outputs: >>> >>> ./vfat_entry /dev/sblsnap0 523793 >>> block 4092, sector 32736, offset 17 >>> name >>> attr 255 >>> lcase 255 >>> ctime_cs 255 >>> ctime 12799 >>> cdate 12670 >>> adate 8224 >>> starthi 8224 >>> time 23072 >>> date 21061 >>> start 32 >>> size 2171155456 >>> >>> So, I take starthi, and shift 16 bits left, then and in the start value. >>> That should give me the byte address of the first cluster of the file, >>> correct? >>> >>> Then I need to follow the cluster chain until I get a bad value. >> It looks like wrong as dirent. Did you use 523793 really? If so, I think >> 523791 is correct value. :) > And I didn't mention about offset correctly. offset means number of > entries, not bytes offset. So, bytes offset is "buf + offset * 32". > (32 == sizeof(struct msdos_dir_entry)) > > Thanks. Ok, I fixed the buf + offset * 32. I have a new volume, so the error is now: fat_get_cluster: invalid cluster chain (i_pos 523781) I added a few lines at the end to print the start value: pos = dirent.starthi << 16; pos |= dirent.start; printf("next pos: %u\n", sector); [root@dev1 sblsnap]# ./vfat_entry /dev/sblsnap0 523781 block 4092, sector 32736, offset 5 name 3~1 ZER attr 32 lcase 0 ctime_cs 100 ctime 29092 cdate 16264 adate 16264 starthi 4 time 29092 date 16264 start 7427 size 37748736 next pos: 32736 [root@dev1 sblsnap]# ./vfat_entry /dev/sblsnap0 32736 block 255, sector 2040, offset 96 name attr 0 lcase 0 ctime_cs 0 ctime 0 cdate 0 adate 0 starthi 0 time 0 date 0 start 0 size 0 next pos: 2040 Does that look like what would be causing my error? meaning, sector 2040 has bad data? ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: What does inconsistent lock state mean?
On 12/08/2011 07:47 AM, Srivatsa Bhat wrote: 2 things: 1. Documentation/lockdep-design.txt explains the "cryptic lock state symbols". 2. Please post the lockdep splat _exactly_ as it appears, and _in full_ (and without line-wrapping, if possible). Usually lockdep is intelligent enough to tell you the possible scenario that would lock up your system. That gives a very good clue, in case you find it difficult to make out what is wrong from the cryptic symbols. Regards, Srivatsa S. Bhat ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies Oh, sorry. I suppose I only included things that made any sense to me. If I were to hazard a guess after reading through the design doc, it sounds like there's a problem with the context in which locks are being acquired? That seems odd to me, since I don't get the inconsistent lock state until I'm trying to spin_unlock &sblsnap_snapshot_table[i].sblsnap_lock (which is why I assume it's listed as one that's currently held. Dec 7 15:52:20 dev2 kernel: = Dec 7 15:52:20 dev2 kernel: [ INFO: inconsistent lock state ] Dec 7 15:52:20 dev2 kernel: 2.6.32-220.el6.x86_64.debug #1 Dec 7 15:52:20 dev2 kernel: - Dec 7 15:52:20 dev2 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. Dec 7 15:52:20 dev2 kernel: tee/1966 [HC0[0]:SC0[0]:HE1:SE1] takes: Dec 7 15:52:20 dev2 kernel: (&vblk->lock){?.-...}, at: [] sblsnap_snap_now+0x25a/0x2a0 [sblsnap] Dec 7 15:52:20 dev2 kernel: {IN-HARDIRQ-W} state was registered at: Dec 7 15:52:20 dev2 kernel: [] __lock_acquire+0x77a/0x1570 Dec 7 15:52:20 dev2 kernel: [] lock_acquire+0xa4/0x120 Dec 7 15:52:20 dev2 kernel: [] _spin_lock_irqsave+0x55/0xa0 Dec 7 15:52:20 dev2 kernel: [] blk_done+0x2b/0x110 [virtio_blk] Dec 7 15:52:20 dev2 kernel: [] vring_interrupt+0x3c/0xd0 [virtio_ring] Dec 7 15:52:20 dev2 kernel: [] handle_IRQ_event+0x50/0x160 Dec 7 15:52:20 dev2 kernel: [] handle_edge_irq+0xe0/0x170 Dec 7 15:52:20 dev2 kernel: [] handle_irq+0x49/0xa0 Dec 7 15:52:20 dev2 kernel: [] do_IRQ+0x6c/0xf0 Dec 7 15:52:20 dev2 kernel: [] ret_from_intr+0x0/0x16 Dec 7 15:52:20 dev2 kernel: [] default_idle+0x52/0xc0 Dec 7 15:52:20 dev2 kernel: [] cpu_idle+0xbb/0x110 Dec 7 15:52:20 dev2 kernel: [] start_secondary+0x211/0x254 Dec 7 15:52:20 dev2 kernel: irq event stamp: 4699 Dec 7 15:52:20 dev2 kernel: hardirqs last enabled at (4699): [] __kmalloc+0x241/0x330 Dec 7 15:52:20 dev2 kernel: hardirqs last disabled at (4698): [] __kmalloc+0x120/0x330 Dec 7 15:52:20 dev2 kernel: softirqs last enabled at (4696): [] __do_softirq+0x14a/0x200 Dec 7 15:52:20 dev2 kernel: softirqs last disabled at (4681): [] call_softirq+0x1c/0x30 Dec 7 15:52:20 dev2 kernel: Dec 7 15:52:20 dev2 kernel: other info that might help us debug this: Dec 7 15:52:20 dev2 kernel: 1 lock held by tee/1966: Dec 7 15:52:20 dev2 kernel: #0: (&sblsnap_snapshot_table[i].sblsnap_lock){+.+.+.}, at: [] sblsnap_snap_now+0xac/0x2a0 [sblsnap] Dec 7 15:52:20 dev2 kernel: Dec 7 15:52:20 dev2 kernel: stack backtrace: Dec 7 15:52:20 dev2 kernel: Pid: 1966, comm: tee Not tainted 2.6.32-220.el6.x86_64.debug #1 Dec 7 15:52:20 dev2 kernel: Call Trace: Dec 7 15:52:20 dev2 kernel: [] ? print_usage_bug+0x177/0x180 Dec 7 15:52:20 dev2 kernel: [] ? mark_lock+0x35d/0x430 Dec 7 15:52:20 dev2 kernel: [] ? __lock_acquire+0x609/0x1570 Dec 7 15:52:20 dev2 kernel: [] ? trace_hardirqs_off+0xd/0x10 Dec 7 15:52:20 dev2 kernel: [] ? _spin_unlock_irqrestore+0x67/0x80 Dec 7 15:52:20 dev2 kernel: [] ? release_console_sem+0x203/0x250 Dec 7 15:52:20 dev2 kernel: [] ? lock_acquire+0xa4/0x120 Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_snap_now+0x25a/0x2a0 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? _spin_lock+0x36/0x70 Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_snap_now+0x25a/0x2a0 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_snap_now+0x25a/0x2a0 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_patch_blkdev+0x74/0x120 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_get_snapshot+0x1f/0x60 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_create_snapshot+0x69/0x120 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? sblsnap_config_write+0x26b/0x2c0 [sblsnap] Dec 7 15:52:20 dev2 kernel: [] ? proc_file_write+0x73/0xb0 Dec 7 15:52:20 dev2 kernel: [] ? proc_file_write+0x0/0xb0 Dec 7 15:52:20 dev2 kernel: [] ? proc_reg_write+0x85/0xc0 Dec 7 15:52:20 dev2 kernel: [] ? vfs_write+0xb8/0x1a0 Dec 7 15:52:20 dev2 kernel: [] ? audit_syscall_entry+0x272/0x2a0 Dec 7 15:52:20 dev2 kernel: [] ? sys_write+0x51/0x90 Dec 7 15:52:20 dev2 kernel: [] ? system_call_fastpath+0x16/0x1b ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/ker
What does inconsistent lock state mean?
I'm getting this when I try to spin_unlock a recently acquired lock with spin_lock. IRQs are still somewhat of a mystery to me, and cryptic lock state symbols (IN-HARDIRQ-W, HARDIRQ-ON-W) are unintelligible to me. Dec 7 15:52:20 dev2 kernel: = Dec 7 15:52:20 dev2 kernel: [ INFO: inconsistent lock state ] Dec 7 15:52:20 dev2 kernel: 2.6.32-220.el6.x86_64.debug #1 Dec 7 15:52:20 dev2 kernel: - Dec 7 15:52:20 dev2 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. Looking at lockdep.c isn't giving me any help either. It's obfuscated beyond my ability to grok by simply reading the code. It seems like this portion should help me, but it doesn't Dec 7 15:52:20 dev2 kernel: {IN-HARDIRQ-W} state was registered at: Dec 7 15:52:20 dev2 kernel: [] __lock_acquire+0x77a/0x1570 Dec 7 15:52:20 dev2 kernel: [] lock_acquire+0xa4/0x120 Dec 7 15:52:20 dev2 kernel: [] _spin_lock_irqsave+0x55/0xa0 Dec 7 15:52:20 dev2 kernel: [] blk_done+0x2b/0x110 [virtio_blk] Dec 7 15:52:20 dev2 kernel: [] vring_interrupt+0x3c/0xd0 [virtio_ring] Dec 7 15:52:20 dev2 kernel: [] handle_IRQ_event+0x50/0x160 Dec 7 15:52:20 dev2 kernel: [] handle_edge_irq+0xe0/0x170 Dec 7 15:52:20 dev2 kernel: [] handle_irq+0x49/0xa0 Dec 7 15:52:20 dev2 kernel: [] do_IRQ+0x6c/0xf0 Dec 7 15:52:20 dev2 kernel: [] ret_from_intr+0x0/0x16 Dec 7 15:52:20 dev2 kernel: [] default_idle+0x52/0xc0 Dec 7 15:52:20 dev2 kernel: [] cpu_idle+0xbb/0x110 Dec 7 15:52:20 dev2 kernel: [] start_secondary+0x211/0x254 Then later it tells me that I'm holding 1 lock, which is the one that I mentioned at the beginning that was just recently locked. ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: VFAT i_pos value
On 12/01/2011 12:20 PM, OGAWA Hirofumi wrote: > Kai Meyer writes: > >>> The i_pos means directory entry (contains inode information in unix-fs) >>> position, >>> >>> block number == i_pos / (logical-blocksize / 32) >>> offset == i_pos& (logical-blocksize / 32) >>> >>> the above position's directory entry contains information for >>> problematic file. This is how to use i_pos information. >>> >>> FWIW, in this error case, the cluster chain in FAT table which is >>> pointed by that entry, it has invalid cluster value. >>> >>> Thanks. >> If you would verify my math for me, I would appreciate it. >> >> In this case, my logical block size is 4096, because byte 13 of the 8Gb >> file system is 8, and I take that to be 8 * 512, which is 4096. So: >> >> block_number = 523791 / (4096 / 32) = 4092 >> offset = 523791 % (4096 / 32) = 15 // I assume you meant modulo in your >> original post, and not binary AND. > Whoops, you are right. (I forgot "-1") > >> So if the block_number is 4092, I would multiply that by 8 (sectors per >> logical block) to get the sector number: >> 32736 > Right. > >> Does the error indicate that sector contains the corrupted data? > No. > >> Or is it the sector that contains the information that points to the >> corrupted data? > Right. > > The i_pos is pointing a directory entry (include/linux/msdos_fs.h: > struct msdos_dir_entry). > > And starthi (if FAT32) and start contain the pointer to next cluster > number. That message was outputted when walking in cluster chain. > > If you want to see actual corrupted data, you can check the cluster > chain by pointing from that directory entry. > > Thanks. Thanks for the helpful response. I'm not entirely sure I understand the next part though. I hacked a dirty entry dumper tool: #include #include #include #include #include #include #include int main(int argc, char** argv) { off_t pos = atoi(argv[2]); unsigned long block; off_t sector; unsigned int offset; int fd = open(argv[1], O_RDONLY); char buf[512]; struct msdos_dir_entry dirent; block = pos / (4096 / 32); sector = block * 8; offset = pos % (4096 / 32); printf("block %lu, sector %lu, offset %u\n", block, sector, offset); lseek(fd, sector * 512, SEEK_SET); if (read(fd, buf, 512) < 0) { fprintf(stderr, "Unable to read from device %s\n", argv[1]); return -1; } memcpy(&dirent, buf + offset, sizeof(dirent)); printf("name %s\n", dirent.name); printf("attr %u\n", dirent.attr); printf("lcase %u\n", dirent.lcase); printf("ctime_cs %u\n", dirent.ctime_cs); printf("ctime %u\n", dirent.ctime); printf("cdate %u\n", dirent.cdate); printf("adate %u\n", dirent.adate); printf("starthi %u\n", dirent.starthi); printf("time %u\n", dirent.time); printf("date %u\n", dirent.date); printf("start %u\n", dirent.start); printf("size %u\n", dirent.size); } Here's what it outputs: ./vfat_entry /dev/sblsnap0 523793 block 4092, sector 32736, offset 17 name attr 255 lcase 255 ctime_cs 255 ctime 12799 cdate 12670 adate 8224 starthi 8224 time 23072 date 21061 start 32 size 2171155456 So, I take starthi, and shift 16 bits left, then and in the start value. That should give me the byte address of the first cluster of the file, correct? Then I need to follow the cluster chain until I get a bad value. Thanks ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Understanding kmap/kunmap
Correction. The problem occurs when 8 bios of size 512 with 1 bvec each all share the same page. I made a bad assumption previously. -Kai Meyer On 12/01/2011 10:49 AM, Kai Meyer wrote: > I want to be able to copy data into a struct bio *, so I use > bio_for_each_segment to loop through each bvec, like so: > > void some_function(struct bio *bio, char *some_data) { > struct bio_vec *bvec; > int i; > unsigned int bio_so_far = 0; > bio_for_each_segment(bvec, bio, i) { > char *bio_buffer = __bio_kmap_atomic(bio, i, KM_USER0); > memcpy(bio_buffer, some_data + bio_so_far, bvec->bv_len); > __bio_kunmap_atomic(bio, KM_USER0); > bio_so_far += bvec->bv_len; > } > } > > There's lots more to the function, but this is basically the distilled > version with out any extra stuff. > > What I'm finding is that when the bio has multiple bvecs that share the > same page, only the first bvec's data actually gets copied back up to > user-space, the rest is garbage or null (meaning, what was there > already). For instance, I see a lot of bios from vfat that are 4096 > bytes long but are comprised of 8 bvecs that are 512 bytes long that all > have an offset to the same page. > > I've tried doing just one kmap_atomic on the page by keeping track of > what the last page I kmap'ed was, but that didn't fix the problem either. > > Any documentation or high level explanation of kmap/kunmap or other > ideas to try are welcome. > > -Kai Meyer > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Understanding kmap/kunmap
I want to be able to copy data into a struct bio *, so I use bio_for_each_segment to loop through each bvec, like so: void some_function(struct bio *bio, char *some_data) { struct bio_vec *bvec; int i; unsigned int bio_so_far = 0; bio_for_each_segment(bvec, bio, i) { char *bio_buffer = __bio_kmap_atomic(bio, i, KM_USER0); memcpy(bio_buffer, some_data + bio_so_far, bvec->bv_len); __bio_kunmap_atomic(bio, KM_USER0); bio_so_far += bvec->bv_len; } } There's lots more to the function, but this is basically the distilled version with out any extra stuff. What I'm finding is that when the bio has multiple bvecs that share the same page, only the first bvec's data actually gets copied back up to user-space, the rest is garbage or null (meaning, what was there already). For instance, I see a lot of bios from vfat that are 4096 bytes long but are comprised of 8 bvecs that are 512 bytes long that all have an offset to the same page. I've tried doing just one kmap_atomic on the page by keeping track of what the last page I kmap'ed was, but that didn't fix the problem either. Any documentation or high level explanation of kmap/kunmap or other ideas to try are welcome. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: VFAT i_pos value
On 12/01/2011 07:38 AM, OGAWA Hirofumi wrote: > Kai Meyer writes: > >> I'm getting this error: >> FAT: Filesystem error (dev sblsnap0) >> fat_get_cluster: invalid cluster chain (i_pos 523791) >> >> I'm wondering if there was a way to figure out what sector is causing >> the error? I would like to try and track down what is changing that >> sector and fix the problem. Is there a straight forward way to convert >> i_pos to a sector value? I've been staring at the fat.c and fat.h code >> all morning, and I'm having trouble grok'ing the flow. > The i_pos means directory entry (contains inode information in unix-fs) > position, > > block number == i_pos / (logical-blocksize / 32) > offset == i_pos& (logical-blocksize / 32) > > the above position's directory entry contains information for > problematic file. This is how to use i_pos information. > > FWIW, in this error case, the cluster chain in FAT table which is > pointed by that entry, it has invalid cluster value. > > Thanks. If you would verify my math for me, I would appreciate it. In this case, my logical block size is 4096, because byte 13 of the 8Gb file system is 8, and I take that to be 8 * 512, which is 4096. So: block_number = 523791 / (4096 / 32) = 4092 offset = 523791 % (4096 / 32) = 15 // I assume you meant modulo in your original post, and not binary AND. So if the block_number is 4092, I would multiply that by 8 (sectors per logical block) to get the sector number: 32736 Does the error indicate that sector contains the corrupted data? Or is it the sector that contains the information that points to the corrupted data? Or is it something entirely different? -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
VFAT i_pos value
I'm getting this error: FAT: Filesystem error (dev sblsnap0) fat_get_cluster: invalid cluster chain (i_pos 523791) I'm wondering if there was a way to figure out what sector is causing the error? I would like to try and track down what is changing that sector and fix the problem. Is there a straight forward way to convert i_pos to a sector value? I've been staring at the fat.c and fat.h code all morning, and I'm having trouble grok'ing the flow. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Freeing work_struct memory
I've got a bug I'm having trouble identifying. It seems like it could be related to my work_struct usage. I essentially have this: struct my_worker { work_struct work; /* some other data */ }; void worker_fn(struct work_struct *work) { struct my_worker *worker = container_of(work, struct my_worker, work); /* ... do some stuff ... */ kfree(worker); } void worker_caller() { struct my_worker *worker = kmalloc(sizeof(*worker), GFP_KERNEL); INIT_WORK(&worker->work, worker_fn); /* ... add some other stuff to *worker ... */ schedule_work(&worker->work); } I frequently get a kernel panic with a specific test, but the stack trace is rarely the same, which seems to indicate to me that I'm corrupting data somewhere. So my question is: Can I free the memory for "struct my_worker *worker" inside worker_fn? Or does the work_queue stuff need to continue to use the "struct work_struct work" member after the end of worker_fn? -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Generic I/O
On 11/15/2011 11:13 AM, mic...@michaelblizek.twilightparadox.com wrote: > Hi! > > On 12:15 Mon 14 Nov , Kai Meyer wrote: > ... > >> My >> caller function has an atomic_t value that I set equal to the number of >> bios I want to submit. Then I pass a pointer to that atomic_t around to >> each of the bios which decrement it in the endio function for that bio. >> >> Then the caller does this: >> while(atomic_read(numbios)> 0) >> msleep(1); >> >> I'm finding the msleep(1) is a really really really long time, >> relatively. It seems to work ok if I just have an empty loop, but it >> also seems to me like I'm re-inventing a wheel here. > ... > > You might want to take a look at wait queues (the kernel equivalent to pthread > "condidions"). Basically you instead of calling msleep(), you call > wait_event(). In the function which decrements numbios, you check whether it > is 0 and if so call wake_up(). > > -Michi That sounds very promising. When I read up on wait_event here: lxr.linux.no/#linux+v2.6.32/include/linux/wait.h#L191 It sounds like it's basically doing the same thing. I would call it like so: wait_event(wq, atomic_read(numbios) == 0); To make sure I understand, this seems very much like what I'm doing, except I'm being woken up every time a bio finishes instead of being woken up once every millisecond. That is, I'm assuming I would use the same work queue for all my bios. During my testing, when I do a lot of disk I/O, I may potentially have hundreds of threads waiting on anywhere between 1 and 32 bios. Help me understand the sort of impact you think I might see between having hundreds waiting for a millisecond, and having hundreds get woken up each time a bio completes. It seems like it would be very helpful in low I/O scenarios, especially when there are fast disks involved. I'm concerned that during heavy I/O loads, I'll be doing a lot of atomic_reads, and I have the impression that atomic_read isn't the cheapest operation. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Generic I/O
I'm finding it's really simple to write generic I/O functions for block devices (via a "struct block_device") to mimic the posix read() and write() functions (I have to supply the position, since I don't have a fd to keep a position for me, but that's perfectly ok). I've got a little hack that allows me to run synchronously or asynchronously, relying on submit_bio() to create the threads for me. My caller function has an atomic_t value that I set equal to the number of bios I want to submit. Then I pass a pointer to that atomic_t around to each of the bios which decrement it in the endio function for that bio. Then the caller does this: while(atomic_read(numbios) > 0) msleep(1); I'm finding the msleep(1) is a really really really long time, relatively. It seems to work ok if I just have an empty loop, but it also seems to me like I'm re-inventing a wheel here. Are there mechanisms that are better suited for waiting for tasks to complete? Or even for generic block I/O functions? -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Spinlocks and interrupts
On 11/10/2011 11:58 PM, Rajat Sharma wrote: > For most of the block drivers bio_endio runs in a context of its > tasklet, so it is indeed atomic context. > > -Rajat > > On Fri, Nov 11, 2011 at 4:50 AM, Kai Meyer wrote: >> >> On 11/10/2011 04:00 PM, Jeff Haran wrote: >>>> -Original Message- >>>> From: kernelnewbies-boun...@kernelnewbies.org [mailto:kernelnewbies- >>>> boun...@kernelnewbies.org] On Behalf Of Kai Meyer >>>> Sent: Thursday, November 10, 2011 1:55 PM >>>> To: kernelnewbies@kernelnewbies.org >>>> Subject: Re: Spinlocks and interrupts >>>> >>>> Alright, to summarize, for my benefit mostly, >>>> >>>> I'm writing a block device driver, which has 2 entry points into my >>> code >>>> that will reach this critical section. It's either the make request >>>> function for the block device, or the resulting bio->bi_end_io >>> function. >>>> I do some waiting with msleep() (for now) from the make request >>> function >>>> entry point, so I'm confident that entry point is not in an atomic >>>> context. I also only end up requesting the critical section to call >>>> kmalloc from this context, which is why I never ran into the >>> scheduling >>>> while atomic issue before. >>>> >>>> I'm fairly certain the critical section executes in thread context not >>>> interrupt context from either entry point. >>>> >>>> I'm certain that the spinlock_t is only ever used in one function (a I >>>> posted a simplified version of the critical section earlier). >>>> >>>> It seems that the critical section is often called in an atomic >>> context. >>>> The spin_lock function sounds like it will only cause a second call to >>>> spin_lock to spin if it is called on a separate core. >>>> >>>> But, since I'm certain the critical section is never called from >>>> interrupt context, only thread context, the fact that pre-emption is >>>> disabled on the core should provide the protection I need with out >>>> having to disable IRQs. Disabling IRQs would prevent an interrupt from >>>> occurring while the lock is acquired. I would like to avoid disabling >>>> interrupts if I don't need to. >>>> >>>> So it sounds like spin_lock/spin_unlock is the correct choice? >>>> >>>> In addition, I'd like to be more confident in my assumptions above. >>> Can >>>> I test for atomic context? For instance, I know that you can call >>>> irqs_disabled(), is there a similar is_atomic() function I can call? I >>>> would like to put a few calls in different places to learn what sort >>> of >>>> context I'm. >>>> >>>> -Kai Meyer >>>> >>>> On 11/10/2011 12:19 PM, Jeff Haran wrote: >>>>>> -Original Message- >>>>>> From: kernelnewbies- >>>> bounces+jharan=bytemobile@kernelnewbies.org >>>>>> [mailto:kernelnewbies- >>>>>> bounces+jharan=bytemobile@kernelnewbies.org] On Behalf Of >>>> Dave >>>>>> Hylands >>>>>> Sent: Thursday, November 10, 2011 11:07 AM >>>>>> To: Kai Meyer >>>>>> Cc: kernelnewbies@kernelnewbies.org >>>>>> Subject: Re: Spinlocks and interrupts >>>>>> >>>>>> Hi Kai, >>>>>> >>>>>> On Thu, Nov 10, 2011 at 10:14 AM, Kai Meyer wrote: >>>>>>> I think I get it. I'm hitting the scheduling while atomic because >>>>> I'm >>>>>>> calling my function from a struct bio's endio function, which is >>>>>>> probably running with a lock held somewhere else, and then my >>> mutex >>>>>>> sleeps, while the spin_lock functions do not sleep. >>>>>> Actually, just holding a lock doesn't create an atomic context. >>>>> I believe on kernels with kernel pre-emption enabled the act of >>> taking >>>>> the lock disables pre-emption. If it didn't work this way you could >>> end >>>>> up taking the lock in one process context and while the lock was >>> held >>>>> get pre-empted. Then another process tries to take the lock and you >>> dead >>>>> lock. >
Re: Spinlocks and interrupts
On 11/10/2011 04:00 PM, Jeff Haran wrote: >> -Original Message- >> From: kernelnewbies-boun...@kernelnewbies.org [mailto:kernelnewbies- >> boun...@kernelnewbies.org] On Behalf Of Kai Meyer >> Sent: Thursday, November 10, 2011 1:55 PM >> To: kernelnewbies@kernelnewbies.org >> Subject: Re: Spinlocks and interrupts >> >> Alright, to summarize, for my benefit mostly, >> >> I'm writing a block device driver, which has 2 entry points into my > code >> that will reach this critical section. It's either the make request >> function for the block device, or the resulting bio->bi_end_io > function. >> I do some waiting with msleep() (for now) from the make request > function >> entry point, so I'm confident that entry point is not in an atomic >> context. I also only end up requesting the critical section to call >> kmalloc from this context, which is why I never ran into the > scheduling >> while atomic issue before. >> >> I'm fairly certain the critical section executes in thread context not >> interrupt context from either entry point. >> >> I'm certain that the spinlock_t is only ever used in one function (a I >> posted a simplified version of the critical section earlier). >> >> It seems that the critical section is often called in an atomic > context. >> The spin_lock function sounds like it will only cause a second call to >> spin_lock to spin if it is called on a separate core. >> >> But, since I'm certain the critical section is never called from >> interrupt context, only thread context, the fact that pre-emption is >> disabled on the core should provide the protection I need with out >> having to disable IRQs. Disabling IRQs would prevent an interrupt from >> occurring while the lock is acquired. I would like to avoid disabling >> interrupts if I don't need to. >> >> So it sounds like spin_lock/spin_unlock is the correct choice? >> >> In addition, I'd like to be more confident in my assumptions above. > Can >> I test for atomic context? For instance, I know that you can call >> irqs_disabled(), is there a similar is_atomic() function I can call? I >> would like to put a few calls in different places to learn what sort > of >> context I'm. >> >> -Kai Meyer >> >> On 11/10/2011 12:19 PM, Jeff Haran wrote: >>>> -Original Message- >>>> From: kernelnewbies- >> bounces+jharan=bytemobile@kernelnewbies.org >>>> [mailto:kernelnewbies- >>>> bounces+jharan=bytemobile@kernelnewbies.org] On Behalf Of >> Dave >>>> Hylands >>>> Sent: Thursday, November 10, 2011 11:07 AM >>>> To: Kai Meyer >>>> Cc: kernelnewbies@kernelnewbies.org >>>> Subject: Re: Spinlocks and interrupts >>>> >>>> Hi Kai, >>>> >>>> On Thu, Nov 10, 2011 at 10:14 AM, Kai Meyer wrote: >>>>> I think I get it. I'm hitting the scheduling while atomic because >>> I'm >>>>> calling my function from a struct bio's endio function, which is >>>>> probably running with a lock held somewhere else, and then my > mutex >>>>> sleeps, while the spin_lock functions do not sleep. >>>> Actually, just holding a lock doesn't create an atomic context. >>> I believe on kernels with kernel pre-emption enabled the act of > taking >>> the lock disables pre-emption. If it didn't work this way you could > end >>> up taking the lock in one process context and while the lock was > held >>> get pre-empted. Then another process tries to take the lock and you > dead >>> lock. >>> >>> Jeff Haran >>> > Kai, you might want to try bottom posting. It is the standard on these > lists. It makes it easier for others to follow the thread. > > I know of no kernel call that you can make to test for current execution > context. There are the in_irq(), in_interrupt() and in_softirq() macros > in hardirq.h, but when I've looked at the code that implements them I've > come to the conclusion that they sometimes will lie. in_softirq() > returns non-zero if you are in a software IRQ. Fair enough. But based on > my reading in the past it's looked to me like it will also return > non-zero if you've disabled bottom halves from process context with say > a call to spin_lock_bh(). > > It would be nice if there were some way of asking the kernel what > context you are in, for debugging if for no other reason, but if it's > there I haven't found it. > > I'd love to be proven wrong here, BTW. If others know better, please > enlighten me. > > Jeff Haran > > > > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies I try to remember to bottom post on message lists, but obviously I've been negligent :) Perhaps I'll just add some calls to msleep() at various places to help me identify when portions of my code are in an atomic context, just to help me learn what's going on. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Spinlocks and interrupts
Alright, to summarize, for my benefit mostly, I'm writing a block device driver, which has 2 entry points into my code that will reach this critical section. It's either the make request function for the block device, or the resulting bio->bi_end_io function. I do some waiting with msleep() (for now) from the make request function entry point, so I'm confident that entry point is not in an atomic context. I also only end up requesting the critical section to call kmalloc from this context, which is why I never ran into the scheduling while atomic issue before. I'm fairly certain the critical section executes in thread context not interrupt context from either entry point. I'm certain that the spinlock_t is only ever used in one function (a I posted a simplified version of the critical section earlier). It seems that the critical section is often called in an atomic context. The spin_lock function sounds like it will only cause a second call to spin_lock to spin if it is called on a separate core. But, since I'm certain the critical section is never called from interrupt context, only thread context, the fact that pre-emption is disabled on the core should provide the protection I need with out having to disable IRQs. Disabling IRQs would prevent an interrupt from occurring while the lock is acquired. I would like to avoid disabling interrupts if I don't need to. So it sounds like spin_lock/spin_unlock is the correct choice? In addition, I'd like to be more confident in my assumptions above. Can I test for atomic context? For instance, I know that you can call irqs_disabled(), is there a similar is_atomic() function I can call? I would like to put a few calls in different places to learn what sort of context I'm. -Kai Meyer On 11/10/2011 12:19 PM, Jeff Haran wrote: >> -Original Message- >> From: kernelnewbies-bounces+jharan=bytemobile@kernelnewbies.org >> [mailto:kernelnewbies- >> bounces+jharan=bytemobile@kernelnewbies.org] On Behalf Of Dave >> Hylands >> Sent: Thursday, November 10, 2011 11:07 AM >> To: Kai Meyer >> Cc: kernelnewbies@kernelnewbies.org >> Subject: Re: Spinlocks and interrupts >> >> Hi Kai, >> >> On Thu, Nov 10, 2011 at 10:14 AM, Kai Meyer wrote: >>> I think I get it. I'm hitting the scheduling while atomic because > I'm >>> calling my function from a struct bio's endio function, which is >>> probably running with a lock held somewhere else, and then my mutex >>> sleeps, while the spin_lock functions do not sleep. >> Actually, just holding a lock doesn't create an atomic context. > I believe on kernels with kernel pre-emption enabled the act of taking > the lock disables pre-emption. If it didn't work this way you could end > up taking the lock in one process context and while the lock was held > get pre-empted. Then another process tries to take the lock and you dead > lock. > > Jeff Haran > > > > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Spinlocks and interrupts
I think I get it. I'm hitting the scheduling while atomic because I'm calling my function from a struct bio's endio function, which is probably running with a lock held somewhere else, and then my mutex sleeps, while the spin_lock functions do not sleep. Perhaps I need to learn more about the context in which my endio function is being called. On 11/10/2011 11:02 AM, Kai Meyer wrote: > Well, I changed my code to use a mutex instead of a spinlock, and now I get: > BUG: scheduling while atomic: swapper/0/0x1001 > All I changed was the spinlock_t to a struct mutex, and call mutex_init, > mutex_lock, and mutex_unlock where I was previously calling the > spin_lock variations. I'm confused. What does mutex_lock do besides set > values in an atomic_t? > > -Kai Meyer > > On 11/10/2011 10:02 AM, Kai Meyer wrote: >> On 11/09/2011 08:38 PM, Dave Hylands wrote: >>> Hi Kai, >>> >>> On Wed, Nov 9, 2011 at 3:12 PM, Kai Meyerwrote: >>>> Ok, I need mutual exclusion on a data structure regardless of interrupts >>>> and core. It sounds like it can be done by using a spinlock and >>>> disabling interrupts, but you mention that "spinlocks are intended to >>>> provide mutual exclsion between interrupt context and non-interrupt >>>> context." Should I be using a semaphore (mutex) instead? >>> It depends. If the function is only called from thread context, then >>> you probably want to use a mutex. If there is a possibility that it >>> might be called from interrupt context, then you can't use a mutex. >>> >>> Also, remember that spin-locks are no-ops on a single processor >>> machine, so as coded, you have no protection on a single-processor >>> machine if you're calling from thread context. >>> >> To make sure I understand you, it sounds like there's two contexts I >> need to be concerned about, thread context and interrupt context. As far >> as I can be sure, this code will only run in thread context. If you >> could verify for me that a block device's make request function is only >> reached in thread context, then that would make me doubly sure. >>>> Perhaps I could explain my problem with some code: >>>> struct my_struct *get_data(spinlock_t *mylock, int ALLOC_DATA) >>>> { >>>>struct my_struct *mydata = NULL; >>>>spin_lock(mylock); >>>>if (test_bit(index, mybitmap)) >>>>mydata = retrieve_data(); >>>>if (!mydata&&ALLOC_DATA) { >>>>mydata = alloc_data(); >>>>set_bit(index, mybitmap); >>>>} >>>>spin_unlock(mylock); >>>>return mydata; >>>> } >>>> >>>> I need to prevent retrieve_data from being called if the index bit is >>>> set in mybitmap and alloc_data has not completed, so I use a bitmap to >>>> indicate that alloc_data has completed. I also need to protect >>>> alloc_data from being run multiple times, so I use the spin_lock to >>>> ensure that test_bit (and possibly retrieve_data) is not run while >>>> alloc_data is being run (because it runs while the bit is cleared). >>> If alloc_data might block, then you can't disable interrupts and you >>> definitely shouldn't be using spinlocks. >>> >> alloc_data will call kmalloc(size, GFP_KERNEL), which I think may block, >> so disabling irqs is out. >> >> Between thread context and kmalloc with GFP_KERNEL, it sounds like your >> suggestion would be to use a mutex. Is that correct? >> >> -Kai Meyer >> >> ___ >> Kernelnewbies mailing list >> Kernelnewbies@kernelnewbies.org >> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Spinlocks and interrupts
Well, I changed my code to use a mutex instead of a spinlock, and now I get: BUG: scheduling while atomic: swapper/0/0x1001 All I changed was the spinlock_t to a struct mutex, and call mutex_init, mutex_lock, and mutex_unlock where I was previously calling the spin_lock variations. I'm confused. What does mutex_lock do besides set values in an atomic_t? -Kai Meyer On 11/10/2011 10:02 AM, Kai Meyer wrote: > > On 11/09/2011 08:38 PM, Dave Hylands wrote: >> Hi Kai, >> >> On Wed, Nov 9, 2011 at 3:12 PM, Kai Meyer wrote: >>> Ok, I need mutual exclusion on a data structure regardless of interrupts >>> and core. It sounds like it can be done by using a spinlock and >>> disabling interrupts, but you mention that "spinlocks are intended to >>> provide mutual exclsion between interrupt context and non-interrupt >>> context." Should I be using a semaphore (mutex) instead? >> It depends. If the function is only called from thread context, then >> you probably want to use a mutex. If there is a possibility that it >> might be called from interrupt context, then you can't use a mutex. >> >> Also, remember that spin-locks are no-ops on a single processor >> machine, so as coded, you have no protection on a single-processor >> machine if you're calling from thread context. >> > To make sure I understand you, it sounds like there's two contexts I > need to be concerned about, thread context and interrupt context. As far > as I can be sure, this code will only run in thread context. If you > could verify for me that a block device's make request function is only > reached in thread context, then that would make me doubly sure. >>> Perhaps I could explain my problem with some code: >>> struct my_struct *get_data(spinlock_t *mylock, int ALLOC_DATA) >>> { >>> struct my_struct *mydata = NULL; >>> spin_lock(mylock); >>> if (test_bit(index, mybitmap)) >>> mydata = retrieve_data(); >>> if (!mydata&& ALLOC_DATA) { >>> mydata = alloc_data(); >>> set_bit(index, mybitmap); >>> } >>> spin_unlock(mylock); >>> return mydata; >>> } >>> >>> I need to prevent retrieve_data from being called if the index bit is >>> set in mybitmap and alloc_data has not completed, so I use a bitmap to >>> indicate that alloc_data has completed. I also need to protect >>> alloc_data from being run multiple times, so I use the spin_lock to >>> ensure that test_bit (and possibly retrieve_data) is not run while >>> alloc_data is being run (because it runs while the bit is cleared). >> If alloc_data might block, then you can't disable interrupts and you >> definitely shouldn't be using spinlocks. >> > alloc_data will call kmalloc(size, GFP_KERNEL), which I think may block, > so disabling irqs is out. > > Between thread context and kmalloc with GFP_KERNEL, it sounds like your > suggestion would be to use a mutex. Is that correct? > > -Kai Meyer > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Spinlocks and interrupts
On 11/09/2011 08:38 PM, Dave Hylands wrote: > Hi Kai, > > On Wed, Nov 9, 2011 at 3:12 PM, Kai Meyer wrote: >> Ok, I need mutual exclusion on a data structure regardless of interrupts >> and core. It sounds like it can be done by using a spinlock and >> disabling interrupts, but you mention that "spinlocks are intended to >> provide mutual exclsion between interrupt context and non-interrupt >> context." Should I be using a semaphore (mutex) instead? > It depends. If the function is only called from thread context, then > you probably want to use a mutex. If there is a possibility that it > might be called from interrupt context, then you can't use a mutex. > > Also, remember that spin-locks are no-ops on a single processor > machine, so as coded, you have no protection on a single-processor > machine if you're calling from thread context. > To make sure I understand you, it sounds like there's two contexts I need to be concerned about, thread context and interrupt context. As far as I can be sure, this code will only run in thread context. If you could verify for me that a block device's make request function is only reached in thread context, then that would make me doubly sure. >> Perhaps I could explain my problem with some code: >> struct my_struct *get_data(spinlock_t *mylock, int ALLOC_DATA) >> { >> struct my_struct *mydata = NULL; >> spin_lock(mylock); >> if (test_bit(index, mybitmap)) >> mydata = retrieve_data(); >> if (!mydata&& ALLOC_DATA) { >> mydata = alloc_data(); >> set_bit(index, mybitmap); >> } >> spin_unlock(mylock); >> return mydata; >> } >> >> I need to prevent retrieve_data from being called if the index bit is >> set in mybitmap and alloc_data has not completed, so I use a bitmap to >> indicate that alloc_data has completed. I also need to protect >> alloc_data from being run multiple times, so I use the spin_lock to >> ensure that test_bit (and possibly retrieve_data) is not run while >> alloc_data is being run (because it runs while the bit is cleared). > If alloc_data might block, then you can't disable interrupts and you > definitely shouldn't be using spinlocks. > alloc_data will call kmalloc(size, GFP_KERNEL), which I think may block, so disabling irqs is out. Between thread context and kmalloc with GFP_KERNEL, it sounds like your suggestion would be to use a mutex. Is that correct? -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Spinlocks and interrupts
Ok, I need mutual exclusion on a data structure regardless of interrupts and core. It sounds like it can be done by using a spinlock and disabling interrupts, but you mention that "spinlocks are intended to provide mutual exclsion between interrupt context and non-interrupt context." Should I be using a semaphore (mutex) instead? Perhaps I could explain my problem with some code: struct my_struct *get_data(spinlock_t *mylock, int ALLOC_DATA) { struct my_struct *mydata = NULL; spin_lock(mylock); if (test_bit(index, mybitmap)) mydata = retrieve_data(); if (!mydata && ALLOC_DATA) { mydata = alloc_data(); set_bit(index, mybitmap); } spin_unlock(mylock); return mydata; } I need to prevent retrieve_data from being called if the index bit is set in mybitmap and alloc_data has not completed, so I use a bitmap to indicate that alloc_data has completed. I also need to protect alloc_data from being run multiple times, so I use the spin_lock to ensure that test_bit (and possibly retrieve_data) is not run while alloc_data is being run (because it runs while the bit is cleared). -Kai Meyer On 11/09/2011 02:40 PM, Dave Hylands wrote: > Hi Kai, > > On Wed, Nov 9, 2011 at 1:07 PM, Kai Meyer wrote: >> When I readup on spinlocks, it seems like I need to choose between >> disabling interrupts and not. If a spinlock_t is never used during an >> interrupt, am I safe to leave interrupts enabled while I hold the lock? >> (Same question for read/write locks if it is different.) > So the intention behind using a spinlock is to provide mutual exclusion. > > A spinlock by itself only really provides mutual exclusion between 2 > cores, and not within the same core. To provide the mutual exclusion > within the same core, you need to disable interrupts. > > Normally, you would disable interrupts and acquire the spinlock to > guarantee that mutual exclusion, and the only reason you would > normally use the spinlock without disabling interrupts is when you > know that interrupts are already disabled. > > The danger of acquiring a spinlock with interrupts enabled is that if > another interrupt fired (or the same interrupt fired again) and it > tried to acquire the same spinlock, then you could have deadlock. > > If no interrupts touch the spinlock, then you're probably using the > wrong mutual exclusion mechanism. spinlocks are really intended to > provide mutual exclsion between interrupt context and non-interrupt > context. > > Also remember, that on a non-SMP (aka UP) build, spinlocks become > no-ops (except when certain debug checking code is enabled). > ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Spinlocks and interrupts
When I readup on spinlocks, it seems like I need to choose between disabling interrupts and not. If a spinlock_t is never used during an interrupt, am I safe to leave interrupts enabled while I hold the lock? (Same question for read/write locks if it is different.) -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
generic I/O
Are there existing generic block device I/O operations available already? I am familiar with constructing and submitting 'struct bio's, but what I'd like to do would be greatly simplified if there was an existing I/O interface similar to the posix 'read' and 'write' functions. If they don't exist, I would probably end up writing functions like: int blk_read(struct block_device *bdev, void *buffer, off_t length); int blk_write(struct block_device *bdev, void *buffer, off_t length); Pros and cons to this sort of approach? -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Trouble removing character device
I do call unregister_chrdev_region. There are 5 functions in my original email that I call during the life of my module. Still no luck. -Kai Meyer On 10/19/2011 09:47 PM, rohan puri wrote: On Thu, Oct 20, 2011 at 2:25 AM, Kai Meyer <mailto:k...@gnukai.com>> wrote: Unfortunately I can't share the source code, it belongs to the company I work for. All of cdev_init, cdev_del, and unregister_chrdev_region are void functions, so they have no return value. I check the return value of alloc_chrdev_region and cdev_add and check for errors with BUG_ON (for now). -Kai Meyer On 10/19/2011 02:18 PM, Daniel Baluta wrote: > On Wed, Oct 19, 2011 at 7:04 PM, Kai Meyermailto:k...@gnukai.com>> wrote: >> I can't seem to get my character device to remove itself from the >> /proc/devices list. I'm calling all of the following functions like so: >> >> alloc_chrdev_region(&dev, 0, 5, "my_char"); >> cdev_init(&my_cdev,&my_ops); >> cdev_add(&my_cdev, MKDEV(my_major, my_minor), 1); >> cdev_del(&my_cdev); >> unregister_chrdev_region(my_major, 5); >> >> It seems like I'm missing something, but I can't find it. I'm >> referencing the Linux Device Drivers v3, chapter 3. In the example code, >> the scull_cleanup_module function calls cdev_dell and >> unregister_chrdev_region, just like I do. >> >> To be clear, after I unload my module (after calling cdev_del and >> unregister_chrdev_region), my "my_char" string still shows up in >> /proc/devices. > Did you check return codes for all functions? > Also, can you post a link to the code? > > thanks, > Daniel. > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org <mailto:Kernelnewbies@kernelnewbies.org> > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org <mailto:Kernelnewbies@kernelnewbies.org> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies During cleanup i think you need to call function unregister_chrdev_region(). Regards, Rohan Puri ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Understanding memcpy
Thanks for catching that :) I knew it would be something simple. On 10/19/2011 10:00 PM, rohan puri wrote: On Thu, Oct 20, 2011 at 3:04 AM, Kai Meyer <mailto:k...@gnukai.com>> wrote: I'm trying to poke around an ext4 file system. I can submit a bio for the correct block, and read in what seems to be the correct information, but when I try to memcpy my char *buffer to a reference to a struct I've made, it just doesn't seem to work. The relevant code looks like this: typedef struct ext2_superblock { /* 00-03 */ uint32_t e2sb_inode_count; /* 04-07 */ uint32_t e2sb_block_count; /* 08-11 */ uint32_t e2sb_blocks_reserved; /* 12-15 */ uint32_t e2sb_unallocated_blocks; /* 16-19 */ uint32_t e2sb_unallocated_inodes; /* 20-23 */ uint32_t e2sb_sb_block; /* 24-27 */ uint32_t e2sb_log_block_size; /* 28-31 */ uint32_t e2sb_log_fragment_size; /* 32-35 */ uint32_t e2sb_num_blocks_per_group; /* 36-39 */ uint32_t e2sb_num_frag_per_group; /* 40-43 */ uint32_t e2sb_num_inodes_per_group; /* 44-47 */ uint32_t e2sb_last_mount_time; /* 48-51 */ uint32_t e2sb_last_written_time; /* 52-53 */ uint16_t e2sb_num_mounted; /* 54-55 */ uint16_t e2sb_num_allowed_mounts; /* 56-57 */ uint16_t e2sb_signature; /* 58-59 */ uint16_t e2sb_fs_state; /* 60-61 */ uint16_t e2sb_error_action; /* 62-63 */ uint16_t e2sb_ver_minor; /* 64-67 */ uint32_t e2sb_last_check; /* 68-71 */ uint32_t e2sb_time_between_checks; /* 72-75 */ uint32_t e2sb_os_id; /* 76-79 */ uint32_t e2sb_ver_major; /* 80-81 */ uint16_t e2sb_uid; /* 82-83 */ uint16_t e2sb_gid; } e2sb; char *buffer; uint32_t *pointer; e2sb sb; buffer = __bio_kmap_atomic(bio, 0, KM_USER0); pointer = (uint32_t *)buffer; printk(KERN_DEBUG "sizeof pbd->sb %lu\n", sizeof(bpd->sb)); printk(KERN_DEBUG "Inode Count: %u\n", pointer[0]); /* Works! */ printk(KERN_DEBUG "Block Count: %u\n", pointer[1]); /* Works! */ printk(KERN_DEBUG "Block Reserved: %u\n", pointer[2]); /* Works! */ printk(KERN_DEBUG "Unallocated blocks: %u\n", pointer[3]); /* Works! */ printk(KERN_DEBUG "Unallocated inodes: %u\n", pointer[4]); /* Works! */ memcpy(buffer, &sb, sizeof(sb)); This should be : - memcpy(&sb, buffer, sizeof(sb)); __bio_kunmap_atomic(bio, KM_USER0); printk(KERN_DEBUG "e2sb_debug: Total number of inodes in file system %u\n", sb->e2sb_inode_count);/* Doesn't work! */ printk(KERN_DEBUG "e2sb_debug: Total number of blocks in file system%u\n", sb->e2sb_block_count); /* Doesn't work! */ My code is actually much more verbose. The values I get from indexing into pointer are correct, and match what I get from dumpe2fs. The values I get from the e2sb struct are not. They are usually 0. I would imagine that memcpy is the fastest way to copy data from buffer instead of casting the pointer to something else, and using array indexing to get the values. I struggled to find where ext4 actually does this, so I'm making this up as I go along. Any thing that you see that I should be doing a different way that isn't actually part of my question is welcome too. ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org <mailto:Kernelnewbies@kernelnewbies.org> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies Regards, Rohan Puri ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Understanding memcpy
I'm trying to poke around an ext4 file system. I can submit a bio for the correct block, and read in what seems to be the correct information, but when I try to memcpy my char *buffer to a reference to a struct I've made, it just doesn't seem to work. The relevant code looks like this: typedef struct ext2_superblock { /* 00-03 */ uint32_t e2sb_inode_count; /* 04-07 */ uint32_t e2sb_block_count; /* 08-11 */ uint32_t e2sb_blocks_reserved; /* 12-15 */ uint32_t e2sb_unallocated_blocks; /* 16-19 */ uint32_t e2sb_unallocated_inodes; /* 20-23 */ uint32_t e2sb_sb_block; /* 24-27 */ uint32_t e2sb_log_block_size; /* 28-31 */ uint32_t e2sb_log_fragment_size; /* 32-35 */ uint32_t e2sb_num_blocks_per_group; /* 36-39 */ uint32_t e2sb_num_frag_per_group; /* 40-43 */ uint32_t e2sb_num_inodes_per_group; /* 44-47 */ uint32_t e2sb_last_mount_time; /* 48-51 */ uint32_t e2sb_last_written_time; /* 52-53 */ uint16_t e2sb_num_mounted; /* 54-55 */ uint16_t e2sb_num_allowed_mounts; /* 56-57 */ uint16_t e2sb_signature; /* 58-59 */ uint16_t e2sb_fs_state; /* 60-61 */ uint16_t e2sb_error_action; /* 62-63 */ uint16_t e2sb_ver_minor; /* 64-67 */ uint32_t e2sb_last_check; /* 68-71 */ uint32_t e2sb_time_between_checks; /* 72-75 */ uint32_t e2sb_os_id; /* 76-79 */ uint32_t e2sb_ver_major; /* 80-81 */ uint16_t e2sb_uid; /* 82-83 */ uint16_t e2sb_gid; } e2sb; char *buffer; uint32_t *pointer; e2sb sb; buffer = __bio_kmap_atomic(bio, 0, KM_USER0); pointer = (uint32_t *)buffer; printk(KERN_DEBUG "sizeof pbd->sb %lu\n", sizeof(bpd->sb)); printk(KERN_DEBUG "Inode Count: %u\n", pointer[0]); /* Works! */ printk(KERN_DEBUG "Block Count: %u\n", pointer[1]); /* Works! */ printk(KERN_DEBUG "Block Reserved: %u\n", pointer[2]); /* Works! */ printk(KERN_DEBUG "Unallocated blocks: %u\n", pointer[3]); /* Works! */ printk(KERN_DEBUG "Unallocated inodes: %u\n", pointer[4]); /* Works! */ memcpy(buffer, &sb, sizeof(sb)); __bio_kunmap_atomic(bio, KM_USER0); printk(KERN_DEBUG "e2sb_debug: Total number of inodes in file system %u\n", sb->e2sb_inode_count);/* Doesn't work! */ printk(KERN_DEBUG "e2sb_debug: Total number of blocks in file system%u\n", sb->e2sb_block_count); /* Doesn't work! */ My code is actually much more verbose. The values I get from indexing into pointer are correct, and match what I get from dumpe2fs. The values I get from the e2sb struct are not. They are usually 0. I would imagine that memcpy is the fastest way to copy data from buffer instead of casting the pointer to something else, and using array indexing to get the values. I struggled to find where ext4 actually does this, so I'm making this up as I go along. Any thing that you see that I should be doing a different way that isn't actually part of my question is welcome too. ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Trouble removing character device
Unfortunately I can't share the source code, it belongs to the company I work for. All of cdev_init, cdev_del, and unregister_chrdev_region are void functions, so they have no return value. I check the return value of alloc_chrdev_region and cdev_add and check for errors with BUG_ON (for now). -Kai Meyer On 10/19/2011 02:18 PM, Daniel Baluta wrote: > On Wed, Oct 19, 2011 at 7:04 PM, Kai Meyer wrote: >> I can't seem to get my character device to remove itself from the >> /proc/devices list. I'm calling all of the following functions like so: >> >> alloc_chrdev_region(&dev, 0, 5, "my_char"); >> cdev_init(&my_cdev,&my_ops); >> cdev_add(&my_cdev, MKDEV(my_major, my_minor), 1); >> cdev_del(&my_cdev); >> unregister_chrdev_region(my_major, 5); >> >> It seems like I'm missing something, but I can't find it. I'm >> referencing the Linux Device Drivers v3, chapter 3. In the example code, >> the scull_cleanup_module function calls cdev_dell and >> unregister_chrdev_region, just like I do. >> >> To be clear, after I unload my module (after calling cdev_del and >> unregister_chrdev_region), my "my_char" string still shows up in >> /proc/devices. > Did you check return codes for all functions? > Also, can you post a link to the code? > > thanks, > Daniel. > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Trouble removing character device
I can't seem to get my character device to remove itself from the /proc/devices list. I'm calling all of the following functions like so: alloc_chrdev_region(&dev, 0, 5, "my_char"); cdev_init(&my_cdev, &my_ops); cdev_add(&my_cdev, MKDEV(my_major, my_minor), 1); cdev_del(&my_cdev); unregister_chrdev_region(my_major, 5); It seems like I'm missing something, but I can't find it. I'm referencing the Linux Device Drivers v3, chapter 3. In the example code, the scull_cleanup_module function calls cdev_dell and unregister_chrdev_region, just like I do. To be clear, after I unload my module (after calling cdev_del and unregister_chrdev_region), my "my_char" string still shows up in /proc/devices. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: how diff between hardlink trees works?
On 09/09/2011 12:39 PM, Vaibhav Jain wrote: On Fri, Sep 9, 2011 at 11:12 AM, Kai Meyer <mailto:k...@gnukai.com>> wrote: On 09/09/2011 09:05 AM, Vaibhav Jain wrote: Hi, I am not able to understand how diff between two trees of which one is just contains hardlinks to another's files (cp -al )ing works.I am asking this question here because I need to build a custom kernel for which I need to generate patch. So the documentation suggests to create a hardlink copy of the kernel source tree using cp -al and then make changes to one of the trees and run a diff.I am wondering that if files are hardlinks then changes to one copy will affect another in which case diff should give no output. Also, the patch I created looks a little odd as it contains complete modified files instead of just the differences. Please help! Thanks Vaibhav Jain ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org <mailto:Kernelnewbies@kernelnewbies.org> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies Make the hard link copy like normal. Then delete the directory that you are making changes to (in the hard link directory), then copy the files over with out hard links. That way "most" of the kernel tree is hard linked, and just the portion you want to work on is a copy. That way the diff will work. Otherwise, skip the hard link part all together, and just make a full copy. Uses lots of disk space and takes longer to diff. -Kai Meyer Hi Kai, Thanks for the reply. I need just one more favour. Could you please look at this document describing the procedure to build custom fedora kernel. It mentions the step to create hardlink to generate but doesn't talk about deleting anything ?I just need to confirm if the article is not accurate or if there is any error in my understanding. Whenever I follow it I get a patch that contains all of the content of the changed files rather than just the changes. Here is the relevant portion : Copy the Source Tree and Generate a Patch This step is for applying a patch to the kernel source. If a patch is not needed, proceed to "Configure Kernel Options". Copy the source tree to preserve the original tree while making changes to the copy: cp -r ~/rpmbuild/BUILD/kernel-2.6.$ver.$fedver/linux-2.6.$ver.$arch ~/rpmbuild/BUILD/kernel-2.6.$ver$fedver.orig cp -al ~/rpmbuild/BUILD/kernel-2.6.$ver.$fedver.orig ~/rpmbuild/BUILD/kernel-2.6.$ver.$fedver.new *The second |cp| command hardlinks the |.orig| and |.new| trees to make |diff| run faster. Most text editors know how to break the hardlink correctly to avoid problems.* Using vim on FC14, it treated the hard link as a hard link and thus the above technique failed. It was necessary to repeat the original copy used for the .orig directory for the .new directory. Note that this uses twice the space. Make changes directly to the code in the |.new| source tree, or copy in a modified file. This file might come from a developer who has requested a test, from the upstream kernel sources, or from a different distribution. After the |.new| source tree is modified, generate a patch. To generate the patch, run |diff| against the entire |.new| and |.orig| source trees with the following command: cd ~/rpmbuild/BUILD diff -uNrp kernel-2.6.$ver.$fedver.orig kernel-2.6.$ver.$fedver.new> ../SOURCES/linux-2.6-my-new-patch.patch Thanks Vaibhav ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies The article says this: "Using vim on FC14, it treated the hard link as a hard link and thus the above technique failed. It was necessary to repeat the original copy used for the .orig directory for the .new directory. Note that this uses twice the space." It means to say that some editors, like VIM, edit files in-place, and some files copy the original contents into some other buffer (memory or temporary file), and then effectively delete the file you're editing, and copy the modified file into place. The hard-link instructions are a "trick" to save time and space when you are modifying large code base, like the kernel. If your favorite editor is behaving like the observed behavor of VIM, then you will need to delete the hard link file, and put a regular copy of the file in place before making changes. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: how diff between hardlink trees works?
On 09/09/2011 09:05 AM, Vaibhav Jain wrote: Hi, I am not able to understand how diff between two trees of which one is just contains hardlinks to another's files (cp -al )ing works.I am asking this question here because I need to build a custom kernel for which I need to generate patch. So the documentation suggests to create a hardlink copy of the kernel source tree using cp -al and then make changes to one of the trees and run a diff.I am wondering that if files are hardlinks then changes to one copy will affect another in which case diff should give no output. Also, the patch I created looks a little odd as it contains complete modified files instead of just the differences. Please help! Thanks Vaibhav Jain ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies Make the hard link copy like normal. Then delete the directory that you are making changes to (in the hard link directory), then copy the files over with out hard links. That way "most" of the kernel tree is hard linked, and just the portion you want to work on is a copy. That way the diff will work. Otherwise, skip the hard link part all together, and just make a full copy. Uses lots of disk space and takes longer to diff. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Filesystem allocation bitmap
Is there a generic filesystem method to retrieve the filesystem's allocation bitmap? I'm mostly interested in Ext filesystems, so if there's nothing generic, I'm happy with a specific solution for just Ext. If the answer is "read https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout and then read the bitmaps directly from disk", I think I can deal with that too. -Kai Meyer ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Query about custom fedora build process
Some programs (like VIM) modify file in-place. Some programs (I think 'perl -pie' may do this) will read the file into memory, and when it's time to write the file back out, it deletes the original, and writes the new one. They made a note of this in the wiki article: "Using vim on FC14, it treated the hard link as a hard link and thus the above technique failed. It was necessary to repeat the original copy used for the .orig directory for the .new directory. Note that this uses twice the space." Perhaps there's a trick to vim to work around it, but I don't know of any. My suggestion would be to hardlink the entire source tree, and then afterwards delete the destination hard links for the files you want to modify, and re-copy (normal copy) the original files again. -Kai Meyer On 09/07/2011 04:16 PM, Vaibhav Jain wrote: Hi, I am trying to build a custom fedora kernel on a fedora (FC15) machine by reading the article: http://fedoraproject.org/wiki/Building_a_custom_kernel but i am unable to make any progress as my changes are not getting reflected. So I have query about the procedure given in the article. The article asks to first create hardliks between files in .new and .orig directories and then to make changes and to .new directory and to generate a patch by a diff between the .orig and .new directories. But I am just wondering if the files in both the directories are hardlinks then how can diff work because after changing a file in .new directory the file in .orig directory should also change. cp -r ~/rpmbuild/BUILD/kernel-2.6.$ver.$fedver/linux-2.6.$ver.$arch ~/rpmbuild/BUILD/kernel-2.6.$ver$fedver.orig cp -al ~/rpmbuild/BUILD/kernel-2.6.$ver.$fedver.orig ~/rpmbuild/BUILD/kernel-2.6.$ver.$fedver.new cd ~/rpmbuild/BUILD diff -uNrp kernel-2.6.$ver.$fedver.orig kernel-2.6.$ver.$fedver.new> ../SOURCES/linux-2.6-my-new-patch.patch Thanks Vaibhav Jain ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies