There is an opensource project, OpenGrok, that uses Lucene for indexing and searching source code:
http://opensolaris.org/os/project/opengrok/ It has Analyzers for different type of source files. It does link source code to requirements but you can take a look at the source code to see how it does the indexing. Bill On Thu, Feb 28, 2008 at 11:18 AM, Ken Krugler <[EMAIL PROTECTED]> wrote: > >I am working on some sort of search mechanism to link a requirement (i.e. > a > >query) to source code files (i.e., documents). For that purpose, I > indexed > >the source code files using Lucene. Contrary to traditional natural > language > >search scenario, we search for code files that are relevant to a given > >requirement. One problem here is that the source files usually contain a > lot > >of abbreviations, words joint by _ or combination of words and/or > >abbreviations (e.x., getAccountBalanceTbl). I am wondering whether > anyone > >of you already did indexing of (source) files or documents which contain > >that kind of words. > > Yes, that's been something we've spent a fair amount of time on...see > http://www.krugle.org (public code search). > > As Mathieu noted, the first thing you really want to do is split the > file up into at least comments vs. code. Then you can use a regular > analyzer (or perhaps something more human language-specific, e.g. > with stemming support) on the comment text, and your own custom > tokenizer on the code. > > In the code, you might further want to treat literals (strings, etc) > differently than other terms. > > And in "real" code terms, then you want to do essentially synonym > processing, where youhttp://opensolaris.org/os/project/opengrok/ turn a > single term into multiple terms based on > the automatic splitting of the term using '_', '-', camelCasing, > letter/digit transitions, etc. > > -- Ken > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it" > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >