On Wed, May 11, 2011 at 10:21:58AM +0200, Trond Norbye wrote:

| Does it log that it is skipping the files, or how do you know that
| they are skipped? (trying to figure out where it could fail).

It silently skips the files.

As for how I originally noticed, opengrok wasn't finding things it
should find.  After struggling with that for a while, I discovered
that many files weren't being indexed.

As for how I know which files aren't indexed, the index process logs
it's progress like this --

   INFO: Add: /whatever/assets/button-delete.png (FileAnalyzer)

and I wrote a perl script to do a find in the source directory and
save the path+name of every file seen, then it scans the indexing log
for those entries I gave above and remove everything reported added
from the list in the script, and at the end dump what's left in the
list.

So I've been testing various permutations of issues.

I originally had 19 projects.  So I looked at the output of my perl
script, found the project with the most unindexed files (it was over
90% for this project), and moved all projects out of the source
directory but it and indexed.

Everything was indexed -- the only files skipped were the .p4config
files used for perforce (though perforce history is disabled for my
testing.)  (I guess opengrok ignores dot files?  That's fine.)
Nothing else was skipped.

So I added in three more projects and indexed.  Same results --
everything was indexed minus those dot files.

So I added in one more project and indexed.  Now a handful of files
were skipped from this new project, but only a handful.

Oh, and side note -- I wrote another program that reads files and
tells me if they have any weird characters (high bit set (not ascii),
rarely used control characters) and all of these skipped files are
straight text.

I now have five projects.  Now, my system has four cpus.  Is it
related to the -T (number of threads to use to index) parameter?  So I
re-ran the test with -T 1 and -T 16 ... same results, with exactly the
same files skipped.

So I put all the (19) projects back in and re-indexed with -T 1.
Oddly enough, this time, almost everything was indexed -- just a few
files weren't.

To be precise ... 9647 files out of 310k were not indexed this time.

So now I'm re-running that test without the -T flag, and that will
take several hours to complete.

I'm not sure the -T flag has any effect on things yet.  What does seem
to matter is the order that directories (projects) appear in the
source directory (the OS is Linux 2.6.26, the filesystem is XFS.)

I'm not certain yet, but the indexer seems to do projects in the order
that they're found in the directory (no sorting), and as it goes
further and further down the list of projects, the more likely it is
to silently skip files.

There are no unexplained errors in the log created while indexing.

As for what files are skipped, it seems completely random.  For
example, let's say I've got a directory with 100 files in it.  30
files will be skipped, 70 indexed with no apparent rhyme or reason to
it.

Actually, now that I think about it, perhaps this is related --

   https://defect.opensolaris.org/bz/show_bug.cgi?id=18227

I opened this two weeks ago as I was getting errors while indexing
with perforce enabled.  To simplify my testing now, I've disabled
perforce and the error has gone away too -- but perhaps the JVM is
still running out of file descriptors, and any errors are just being
swallowed up rather than reported?  (And they were swallowed up before
too, but at least the perforce code complained about it?)

I have not been making sure my file-descriptor limit was the same for
all my tests -- in fact, some might have the default of 1024 and some
16384.  I'll have to redo them and watch that more carefully, and use
lsof during tests to see what it looks like.

As for not using the OpenGrok script, back when I was figuring
OpenGrok out, I was having some problem with that script, but got
doing the indexing manually working.  I imagine the problem with the
script was simply operator error, but since I got used to doing the
indexing manually, I've stuck with that.  But I'll go back and get the
script going if I need to -- but so far I don't think that's going to
make a difference.

 --
Doug McLaren, dou...@frenzied.us
_______________________________________________
opengrok-discuss mailing list
opengrok-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/opengrok-discuss

Reply via email to