By "index the entire source file" do you mean "don't index the compiler
output"? If so, that doesn't sound very appealing as it loses most of the
benefit of having a search engine built for searching source code.


On Thu, Jun 5, 2014 at 11:11 AM, Aditya <findbestopensou...@gmail.com>
wrote:

> Just keep it simple. Index the entire source file. One source file is one
> document. While indexing preserve dot (.), Hypen(-) and other special
> characters. You could use whitespace analyzer.
>
> I hope it helps
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
> On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tib...@gmail.com>
> wrote:
>
> > The the majority of queries will be look-ups of functions/types by fully
> > qualified name. For example, the query [Data.Map.insert] will find the
> > definition and all uses of the `insert` function defined in the
> `Data.Map`
> > module. The corpus is all Haskell open source code on
> hackage.haskell.org.
> >
> > Being able to support qualified name queries is the main benefit of
> > indexing the output of the compiler (which has resolved unqualified names
> > to qualified names) rather than using a simple text-based indexing.
> >
> > There are three levels of name qualification I want to support in
> queries:
> >
> >  * Unqualified: myFunction
> >  * Module qualified: MyModule.myFunction
> >  * Package and module qualified: mypackage-MyModule.myFunction
> >
> > I expect the middle one to be used the most. The last form is sometimes
> > needed for disambiguation and the first is nice to support as a shorthand
> > when the function name is unlikely to be ambiguous.
> >
> > For scoring I'd like to have a couple of attributes available. The most
> > important one is whether a term represents a use site or a definition
> site.
> > This would allow the definition of a function to appear as the first
> search
> > result.
> >
> > Is this precise enough? Naturally the scope will grow over time, but this
> > is the core of what I'm trying to do.
> >
> > -- Johan
> >
> >
> > On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensou...@gmail.com>
> > wrote:
> >
> > > Hi Johan,
> > >
> > > How you want to search, What is your search requirement and according
> to
> > > that you need to index. You could check duckduckgo or github code
> search.
> > >
> > > The easiest approach would be to have a parser which will read each
> > source
> > > file and indexes as a single document. When you search, you will have a
> > > single search field which will search the index and retrieves the
> result.
> > > The search field accepts any text in the source file. It could be
> > function
> > > name, class name, comments or variables etc.
> > >
> > > Another approach is to have different search fields for Functions,
> > Classes,
> > > Package etc.  You need to parse the file, identify comments, function
> > name,
> > > class name etc and index it in a separate field.
> > >
> > >
> > > Regards
> > > Aditya
> > > www.findbestopensource.com
> > >
> > >
> > >
> > >
> > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <johan.tib...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'd like to index (Haskell) source code. I've run the source code
> > > through a
> > > > compiler (GHC) to get rich information about each token (its type,
> > fully
> > > > qualified name, etc) that I want to index (and later use when
> ranking).
> > > >
> > > > I'm wondering how to approach indexing source code. I can see two
> > > possible
> > > > approaches:
> > > >
> > > >  * Create a file containing all the metadata and write a custom
> > > > tokenizer/analyzer that processes the file. The file could use a
> simple
> > > > line-based format:
> > > >
> > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata
> > > > myFunction,5:11-5:21,my-package,used-here,more-metadata
> > > > ...
> > > >
> > > > The tokenizer would use CharTermAttribute to write the function name,
> > > > OffsetAttribute to write the source span, etc.
> > > >
> > > >  * Use and IndexWriter to create a Document directly, as done here:
> > > >
> > > >
> > >
> >
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
> > > >
> > > > I'm new to Lucene so I can't quite tell which approach is more likely
> > to
> > > > work well. Which way would you recommend?
> > > >
> > > > Other things I'd like to do that might influence the answer:
> > > >
> > > >  - Index several tokens at the same position, so I can index both the
> > > fully
> > > > qualified name (e.g. module.myFunction) and unqualified name (e.g.
> > > > myFunction) for a term.
> > > >
> > > > -- Johan
> > > >
> > >
> >
>

Reply via email to