On Wed, May 11, 2011 at 10:21:58AM +0200, Trond Norbye wrote: | Does it log that it is skipping the files, or how do you know that | they are skipped? (trying to figure out where it could fail).
It silently skips the files. As for how I originally noticed, opengrok wasn't finding things it should find. After struggling with that for a while, I discovered that many files weren't being indexed. As for how I know which files aren't indexed, the index process logs it's progress like this -- INFO: Add: /whatever/assets/button-delete.png (FileAnalyzer) and I wrote a perl script to do a find in the source directory and save the path+name of every file seen, then it scans the indexing log for those entries I gave above and remove everything reported added from the list in the script, and at the end dump what's left in the list. So I've been testing various permutations of issues. I originally had 19 projects. So I looked at the output of my perl script, found the project with the most unindexed files (it was over 90% for this project), and moved all projects out of the source directory but it and indexed. Everything was indexed -- the only files skipped were the .p4config files used for perforce (though perforce history is disabled for my testing.) (I guess opengrok ignores dot files? That's fine.) Nothing else was skipped. So I added in three more projects and indexed. Same results -- everything was indexed minus those dot files. So I added in one more project and indexed. Now a handful of files were skipped from this new project, but only a handful. Oh, and side note -- I wrote another program that reads files and tells me if they have any weird characters (high bit set (not ascii), rarely used control characters) and all of these skipped files are straight text. I now have five projects. Now, my system has four cpus. Is it related to the -T (number of threads to use to index) parameter? So I re-ran the test with -T 1 and -T 16 ... same results, with exactly the same files skipped. So I put all the (19) projects back in and re-indexed with -T 1. Oddly enough, this time, almost everything was indexed -- just a few files weren't. To be precise ... 9647 files out of 310k were not indexed this time. So now I'm re-running that test without the -T flag, and that will take several hours to complete. I'm not sure the -T flag has any effect on things yet. What does seem to matter is the order that directories (projects) appear in the source directory (the OS is Linux 2.6.26, the filesystem is XFS.) I'm not certain yet, but the indexer seems to do projects in the order that they're found in the directory (no sorting), and as it goes further and further down the list of projects, the more likely it is to silently skip files. There are no unexplained errors in the log created while indexing. As for what files are skipped, it seems completely random. For example, let's say I've got a directory with 100 files in it. 30 files will be skipped, 70 indexed with no apparent rhyme or reason to it. Actually, now that I think about it, perhaps this is related -- https://defect.opensolaris.org/bz/show_bug.cgi?id=18227 I opened this two weeks ago as I was getting errors while indexing with perforce enabled. To simplify my testing now, I've disabled perforce and the error has gone away too -- but perhaps the JVM is still running out of file descriptors, and any errors are just being swallowed up rather than reported? (And they were swallowed up before too, but at least the perforce code complained about it?) I have not been making sure my file-descriptor limit was the same for all my tests -- in fact, some might have the default of 1024 and some 16384. I'll have to redo them and watch that more carefully, and use lsof during tests to see what it looks like. As for not using the OpenGrok script, back when I was figuring OpenGrok out, I was having some problem with that script, but got doing the indexing manually working. I imagine the problem with the script was simply operator error, but since I got used to doing the indexing manually, I've stuck with that. But I'll go back and get the script going if I need to -- but so far I don't think that's going to make a difference. -- Doug McLaren, dou...@frenzied.us _______________________________________________ opengrok-discuss mailing list opengrok-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/opengrok-discuss