By "index the entire source file" do you mean "don't index the compiler output"? If so, that doesn't sound very appealing as it loses most of the benefit of having a search engine built for searching source code.
On Thu, Jun 5, 2014 at 11:11 AM, Aditya <findbestopensou...@gmail.com> wrote: > Just keep it simple. Index the entire source file. One source file is one > document. While indexing preserve dot (.), Hypen(-) and other special > characters. You could use whitespace analyzer. > > I hope it helps > > Regards > Aditya > www.findbestopensource.com > > > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tib...@gmail.com> > wrote: > > > The the majority of queries will be look-ups of functions/types by fully > > qualified name. For example, the query [Data.Map.insert] will find the > > definition and all uses of the `insert` function defined in the > `Data.Map` > > module. The corpus is all Haskell open source code on > hackage.haskell.org. > > > > Being able to support qualified name queries is the main benefit of > > indexing the output of the compiler (which has resolved unqualified names > > to qualified names) rather than using a simple text-based indexing. > > > > There are three levels of name qualification I want to support in > queries: > > > > * Unqualified: myFunction > > * Module qualified: MyModule.myFunction > > * Package and module qualified: mypackage-MyModule.myFunction > > > > I expect the middle one to be used the most. The last form is sometimes > > needed for disambiguation and the first is nice to support as a shorthand > > when the function name is unlikely to be ambiguous. > > > > For scoring I'd like to have a couple of attributes available. The most > > important one is whether a term represents a use site or a definition > site. > > This would allow the definition of a function to appear as the first > search > > result. > > > > Is this precise enough? Naturally the scope will grow over time, but this > > is the core of what I'm trying to do. > > > > -- Johan > > > > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensou...@gmail.com> > > wrote: > > > > > Hi Johan, > > > > > > How you want to search, What is your search requirement and according > to > > > that you need to index. You could check duckduckgo or github code > search. > > > > > > The easiest approach would be to have a parser which will read each > > source > > > file and indexes as a single document. When you search, you will have a > > > single search field which will search the index and retrieves the > result. > > > The search field accepts any text in the source file. It could be > > function > > > name, class name, comments or variables etc. > > > > > > Another approach is to have different search fields for Functions, > > Classes, > > > Package etc. You need to parse the file, identify comments, function > > name, > > > class name etc and index it in a separate field. > > > > > > > > > Regards > > > Aditya > > > www.findbestopensource.com > > > > > > > > > > > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <johan.tib...@gmail.com> > > > wrote: > > > > > > > Hi, > > > > > > > > I'd like to index (Haskell) source code. I've run the source code > > > through a > > > > compiler (GHC) to get rich information about each token (its type, > > fully > > > > qualified name, etc) that I want to index (and later use when > ranking). > > > > > > > > I'm wondering how to approach indexing source code. I can see two > > > possible > > > > approaches: > > > > > > > > * Create a file containing all the metadata and write a custom > > > > tokenizer/analyzer that processes the file. The file could use a > simple > > > > line-based format: > > > > > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > > > > myFunction,5:11-5:21,my-package,used-here,more-metadata > > > > ... > > > > > > > > The tokenizer would use CharTermAttribute to write the function name, > > > > OffsetAttribute to write the source span, etc. > > > > > > > > * Use and IndexWriter to create a Document directly, as done here: > > > > > > > > > > > > > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > > > > > > > I'm new to Lucene so I can't quite tell which approach is more likely > > to > > > > work well. Which way would you recommend? > > > > > > > > Other things I'd like to do that might influence the answer: > > > > > > > > - Index several tokens at the same position, so I can index both the > > > fully > > > > qualified name (e.g. module.myFunction) and unqualified name (e.g. > > > > myFunction) for a term. > > > > > > > > -- Johan > > > > > > > > > >