How to approach indexing source code?

2014-06-03 Thread Johan Tibell
Hi,

I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).

I'm wondering how to approach indexing source code. I can see two possible
approaches:

 * Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

 * Use and IndexWriter to create a Document directly, as done here:
http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

 - Index several tokens at the same position, so I can index both the fully
qualified name (e.g. module.myFunction) and unqualified name (e.g.
myFunction) for a term.

-- Johan


Re: How to approach indexing source code?

2014-06-03 Thread Jack Krupansky
The first question for any search app should always be: How do you intend to 
query the data? That will in large part determine how you should index the 
data.


IOW, how do you intend to use the data? Be specific.

Provide some sample queries and then work backwards to how the data needs to 
be indexed.


-- Jack Krupansky

-Original Message- 
From: Johan Tibell

Sent: Tuesday, June 3, 2014 9:32 PM
To: java-user@lucene.apache.org
Subject: How to approach indexing source code?

Hi,

I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).

I'm wondering how to approach indexing source code. I can see two possible
approaches:

* Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

* Use and IndexWriter to create a Document directly, as done here:
http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

- Index several tokens at the same position, so I can index both the fully
qualified name (e.g. module.myFunction) and unqualified name (e.g.
myFunction) for a term.

-- Johan 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to approach indexing source code?

2014-06-03 Thread Aditya
Hi Johan,

How you want to search, What is your search requirement and according to
that you need to index. You could check duckduckgo or github code search.

The easiest approach would be to have a parser which will read each source
file and indexes as a single document. When you search, you will have a
single search field which will search the index and retrieves the result.
The search field accepts any text in the source file. It could be function
name, class name, comments or variables etc.

Another approach is to have different search fields for Functions, Classes,
Package etc.  You need to parse the file, identify comments, function name,
class name etc and index it in a separate field.


Regards
Aditya
www.findbestopensource.com




On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell  wrote:

> Hi,
>
> I'd like to index (Haskell) source code. I've run the source code through a
> compiler (GHC) to get rich information about each token (its type, fully
> qualified name, etc) that I want to index (and later use when ranking).
>
> I'm wondering how to approach indexing source code. I can see two possible
> approaches:
>
>  * Create a file containing all the metadata and write a custom
> tokenizer/analyzer that processes the file. The file could use a simple
> line-based format:
>
> myFunction,1:12-1:22,my-package,defined-here,more-metadata
> myFunction,5:11-5:21,my-package,used-here,more-metadata
> ...
>
> The tokenizer would use CharTermAttribute to write the function name,
> OffsetAttribute to write the source span, etc.
>
>  * Use and IndexWriter to create a Document directly, as done here:
>
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
>
> I'm new to Lucene so I can't quite tell which approach is more likely to
> work well. Which way would you recommend?
>
> Other things I'd like to do that might influence the answer:
>
>  - Index several tokens at the same position, so I can index both the fully
> qualified name (e.g. module.myFunction) and unqualified name (e.g.
> myFunction) for a term.
>
> -- Johan
>


Re: How to approach indexing source code?

2014-06-04 Thread Johan Tibell
The the majority of queries will be look-ups of functions/types by fully
qualified name. For example, the query [Data.Map.insert] will find the
definition and all uses of the `insert` function defined in the `Data.Map`
module. The corpus is all Haskell open source code on hackage.haskell.org.

Being able to support qualified name queries is the main benefit of
indexing the output of the compiler (which has resolved unqualified names
to qualified names) rather than using a simple text-based indexing.

There are three levels of name qualification I want to support in queries:

 * Unqualified: myFunction
 * Module qualified: MyModule.myFunction
 * Package and module qualified: mypackage-MyModule.myFunction

I expect the middle one to be used the most. The last form is sometimes
needed for disambiguation and the first is nice to support as a shorthand
when the function name is unlikely to be ambiguous.

For scoring I'd like to have a couple of attributes available. The most
important one is whether a term represents a use site or a definition site.
This would allow the definition of a function to appear as the first search
result.

Is this precise enough? Naturally the scope will grow over time, but this
is the core of what I'm trying to do.

-- Johan


On Wed, Jun 4, 2014 at 8:02 AM, Aditya  wrote:

> Hi Johan,
>
> How you want to search, What is your search requirement and according to
> that you need to index. You could check duckduckgo or github code search.
>
> The easiest approach would be to have a parser which will read each source
> file and indexes as a single document. When you search, you will have a
> single search field which will search the index and retrieves the result.
> The search field accepts any text in the source file. It could be function
> name, class name, comments or variables etc.
>
> Another approach is to have different search fields for Functions, Classes,
> Package etc.  You need to parse the file, identify comments, function name,
> class name etc and index it in a separate field.
>
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
>
> On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell 
> wrote:
>
> > Hi,
> >
> > I'd like to index (Haskell) source code. I've run the source code
> through a
> > compiler (GHC) to get rich information about each token (its type, fully
> > qualified name, etc) that I want to index (and later use when ranking).
> >
> > I'm wondering how to approach indexing source code. I can see two
> possible
> > approaches:
> >
> >  * Create a file containing all the metadata and write a custom
> > tokenizer/analyzer that processes the file. The file could use a simple
> > line-based format:
> >
> > myFunction,1:12-1:22,my-package,defined-here,more-metadata
> > myFunction,5:11-5:21,my-package,used-here,more-metadata
> > ...
> >
> > The tokenizer would use CharTermAttribute to write the function name,
> > OffsetAttribute to write the source span, etc.
> >
> >  * Use and IndexWriter to create a Document directly, as done here:
> >
> >
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
> >
> > I'm new to Lucene so I can't quite tell which approach is more likely to
> > work well. Which way would you recommend?
> >
> > Other things I'd like to do that might influence the answer:
> >
> >  - Index several tokens at the same position, so I can index both the
> fully
> > qualified name (e.g. module.myFunction) and unqualified name (e.g.
> > myFunction) for a term.
> >
> > -- Johan
> >
>


Re: How to approach indexing source code?

2014-06-04 Thread Michael Sokolov
Probably the simplest thing is to define a field for each of the 
contexts you are interested in, but you might want to consider using a 
tagged-token approach.


I spent a while figuring out how to index tagged tree-structured data 
and came up with Lux (http://luxdb.org) - basically it accepts XML and 
indexes all the text using tagname prefixes; each word gets indexed as 
itself, and with tags as prefixes (something like: Data.Map.insert ; 
function-definition:Data.Map.Insert; function-call:Data.Map.Insert, 
etc).  So one approach would be to convert your syntax tree into XML and 
use a generic XML indexing solution (there are others) based on Lucene.


Or you could just borrow the same idea and build your own TokenStream 
that produces tagged tokens.


With this tagged token approach, you don't need to define a field for 
every different possible tag; you can just use a generic tagged-text 
field, and include the tag as part of the indexed token in that field.  
It also makes it possible to perform proximity queries with tokens that 
have different tags; I don't know if it is possible to do that when the 
tokens are in different fields.


Another option is to use payloads to store additional information about 
each token; if you search for part-of-speech tagging with Lucene you 
should find a lot of discussion about a parallel use case (people want 
to tag words as verbs, nouns, etc).  I seem to remember someone using 
payloads for that, although I think that involves more low-level Lucene 
programming than the tagged-token approach I described above.


-Mike

On 6/4/2014 5:59 AM, Johan Tibell wrote:

The the majority of queries will be look-ups of functions/types by fully
qualified name. For example, the query [Data.Map.insert] will find the
definition and all uses of the `insert` function defined in the `Data.Map`
module. The corpus is all Haskell open source code on hackage.haskell.org.

Being able to support qualified name queries is the main benefit of
indexing the output of the compiler (which has resolved unqualified names
to qualified names) rather than using a simple text-based indexing.

There are three levels of name qualification I want to support in queries:

  * Unqualified: myFunction
  * Module qualified: MyModule.myFunction
  * Package and module qualified: mypackage-MyModule.myFunction

I expect the middle one to be used the most. The last form is sometimes
needed for disambiguation and the first is nice to support as a shorthand
when the function name is unlikely to be ambiguous.

For scoring I'd like to have a couple of attributes available. The most
important one is whether a term represents a use site or a definition site.
This would allow the definition of a function to appear as the first search
result.

Is this precise enough? Naturally the scope will grow over time, but this
is the core of what I'm trying to do.

-- Johan


On Wed, Jun 4, 2014 at 8:02 AM, Aditya  wrote:


Hi Johan,

How you want to search, What is your search requirement and according to
that you need to index. You could check duckduckgo or github code search.

The easiest approach would be to have a parser which will read each source
file and indexes as a single document. When you search, you will have a
single search field which will search the index and retrieves the result.
The search field accepts any text in the source file. It could be function
name, class name, comments or variables etc.

Another approach is to have different search fields for Functions, Classes,
Package etc.  You need to parse the file, identify comments, function name,
class name etc and index it in a separate field.


Regards
Aditya
www.findbestopensource.com




On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell 
wrote:


Hi,

I'd like to index (Haskell) source code. I've run the source code

through a

compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).

I'm wondering how to approach indexing source code. I can see two

possible

approaches:

  * Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

  * Use and IndexWriter to create a Document directly, as done here:



http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

  - Index several tokens at the same position, so I can index both the

fully

qualified name (e.g. module.myFunction) and unqual

Re: How to approach indexing source code?

2014-06-05 Thread Aditya
Just keep it simple. Index the entire source file. One source file is one
document. While indexing preserve dot (.), Hypen(-) and other special
characters. You could use whitespace analyzer.

I hope it helps

Regards
Aditya
www.findbestopensource.com


On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell  wrote:

> The the majority of queries will be look-ups of functions/types by fully
> qualified name. For example, the query [Data.Map.insert] will find the
> definition and all uses of the `insert` function defined in the `Data.Map`
> module. The corpus is all Haskell open source code on hackage.haskell.org.
>
> Being able to support qualified name queries is the main benefit of
> indexing the output of the compiler (which has resolved unqualified names
> to qualified names) rather than using a simple text-based indexing.
>
> There are three levels of name qualification I want to support in queries:
>
>  * Unqualified: myFunction
>  * Module qualified: MyModule.myFunction
>  * Package and module qualified: mypackage-MyModule.myFunction
>
> I expect the middle one to be used the most. The last form is sometimes
> needed for disambiguation and the first is nice to support as a shorthand
> when the function name is unlikely to be ambiguous.
>
> For scoring I'd like to have a couple of attributes available. The most
> important one is whether a term represents a use site or a definition site.
> This would allow the definition of a function to appear as the first search
> result.
>
> Is this precise enough? Naturally the scope will grow over time, but this
> is the core of what I'm trying to do.
>
> -- Johan
>
>
> On Wed, Jun 4, 2014 at 8:02 AM, Aditya 
> wrote:
>
> > Hi Johan,
> >
> > How you want to search, What is your search requirement and according to
> > that you need to index. You could check duckduckgo or github code search.
> >
> > The easiest approach would be to have a parser which will read each
> source
> > file and indexes as a single document. When you search, you will have a
> > single search field which will search the index and retrieves the result.
> > The search field accepts any text in the source file. It could be
> function
> > name, class name, comments or variables etc.
> >
> > Another approach is to have different search fields for Functions,
> Classes,
> > Package etc.  You need to parse the file, identify comments, function
> name,
> > class name etc and index it in a separate field.
> >
> >
> > Regards
> > Aditya
> > www.findbestopensource.com
> >
> >
> >
> >
> > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell 
> > wrote:
> >
> > > Hi,
> > >
> > > I'd like to index (Haskell) source code. I've run the source code
> > through a
> > > compiler (GHC) to get rich information about each token (its type,
> fully
> > > qualified name, etc) that I want to index (and later use when ranking).
> > >
> > > I'm wondering how to approach indexing source code. I can see two
> > possible
> > > approaches:
> > >
> > >  * Create a file containing all the metadata and write a custom
> > > tokenizer/analyzer that processes the file. The file could use a simple
> > > line-based format:
> > >
> > > myFunction,1:12-1:22,my-package,defined-here,more-metadata
> > > myFunction,5:11-5:21,my-package,used-here,more-metadata
> > > ...
> > >
> > > The tokenizer would use CharTermAttribute to write the function name,
> > > OffsetAttribute to write the source span, etc.
> > >
> > >  * Use and IndexWriter to create a Document directly, as done here:
> > >
> > >
> >
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
> > >
> > > I'm new to Lucene so I can't quite tell which approach is more likely
> to
> > > work well. Which way would you recommend?
> > >
> > > Other things I'd like to do that might influence the answer:
> > >
> > >  - Index several tokens at the same position, so I can index both the
> > fully
> > > qualified name (e.g. module.myFunction) and unqualified name (e.g.
> > > myFunction) for a term.
> > >
> > > -- Johan
> > >
> >
>


Re: How to approach indexing source code?

2014-06-05 Thread Johan Tibell
By "index the entire source file" do you mean "don't index the compiler
output"? If so, that doesn't sound very appealing as it loses most of the
benefit of having a search engine built for searching source code.


On Thu, Jun 5, 2014 at 11:11 AM, Aditya 
wrote:

> Just keep it simple. Index the entire source file. One source file is one
> document. While indexing preserve dot (.), Hypen(-) and other special
> characters. You could use whitespace analyzer.
>
> I hope it helps
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
> On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell 
> wrote:
>
> > The the majority of queries will be look-ups of functions/types by fully
> > qualified name. For example, the query [Data.Map.insert] will find the
> > definition and all uses of the `insert` function defined in the
> `Data.Map`
> > module. The corpus is all Haskell open source code on
> hackage.haskell.org.
> >
> > Being able to support qualified name queries is the main benefit of
> > indexing the output of the compiler (which has resolved unqualified names
> > to qualified names) rather than using a simple text-based indexing.
> >
> > There are three levels of name qualification I want to support in
> queries:
> >
> >  * Unqualified: myFunction
> >  * Module qualified: MyModule.myFunction
> >  * Package and module qualified: mypackage-MyModule.myFunction
> >
> > I expect the middle one to be used the most. The last form is sometimes
> > needed for disambiguation and the first is nice to support as a shorthand
> > when the function name is unlikely to be ambiguous.
> >
> > For scoring I'd like to have a couple of attributes available. The most
> > important one is whether a term represents a use site or a definition
> site.
> > This would allow the definition of a function to appear as the first
> search
> > result.
> >
> > Is this precise enough? Naturally the scope will grow over time, but this
> > is the core of what I'm trying to do.
> >
> > -- Johan
> >
> >
> > On Wed, Jun 4, 2014 at 8:02 AM, Aditya 
> > wrote:
> >
> > > Hi Johan,
> > >
> > > How you want to search, What is your search requirement and according
> to
> > > that you need to index. You could check duckduckgo or github code
> search.
> > >
> > > The easiest approach would be to have a parser which will read each
> > source
> > > file and indexes as a single document. When you search, you will have a
> > > single search field which will search the index and retrieves the
> result.
> > > The search field accepts any text in the source file. It could be
> > function
> > > name, class name, comments or variables etc.
> > >
> > > Another approach is to have different search fields for Functions,
> > Classes,
> > > Package etc.  You need to parse the file, identify comments, function
> > name,
> > > class name etc and index it in a separate field.
> > >
> > >
> > > Regards
> > > Aditya
> > > www.findbestopensource.com
> > >
> > >
> > >
> > >
> > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'd like to index (Haskell) source code. I've run the source code
> > > through a
> > > > compiler (GHC) to get rich information about each token (its type,
> > fully
> > > > qualified name, etc) that I want to index (and later use when
> ranking).
> > > >
> > > > I'm wondering how to approach indexing source code. I can see two
> > > possible
> > > > approaches:
> > > >
> > > >  * Create a file containing all the metadata and write a custom
> > > > tokenizer/analyzer that processes the file. The file could use a
> simple
> > > > line-based format:
> > > >
> > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata
> > > > myFunction,5:11-5:21,my-package,used-here,more-metadata
> > > > ...
> > > >
> > > > The tokenizer would use CharTermAttribute to write the function name,
> > > > OffsetAttribute to write the source span, etc.
> > > >
> > > >  * Use and IndexWriter to create a Document directly, as done here:
> > > >
> > > >
> > >
> >
> http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3
> > > >
> > > > I'm new to Lucene so I can't quite tell which approach is more likely
> > to
> > > > work well. Which way would you recommend?
> > > >
> > > > Other things I'd like to do that might influence the answer:
> > > >
> > > >  - Index several tokens at the same position, so I can index both the
> > > fully
> > > > qualified name (e.g. module.myFunction) and unqualified name (e.g.
> > > > myFunction) for a term.
> > > >
> > > > -- Johan
> > > >
> > >
> >
>


Re: How to approach indexing source code?

2014-06-05 Thread Aditya
It is up to your requirement. You could either index source file or
compiler output. Try doing some proof of concept. You will get some idea of
how to move forward.

Regards
Aditya
www.findbestopensource.com




On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell  wrote:

> By "index the entire source file" do you mean "don't index the compiler
> output"? If so, that doesn't sound very appealing as it loses most of the
> benefit of having a search engine built for searching source code.
>
>
> On Thu, Jun 5, 2014 at 11:11 AM, Aditya 
> wrote:
>
> > Just keep it simple. Index the entire source file. One source file is one
> > document. While indexing preserve dot (.), Hypen(-) and other special
> > characters. You could use whitespace analyzer.
> >
> > I hope it helps
> >
> > Regards
> > Aditya
> > www.findbestopensource.com
> >
> >
> > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell 
> > wrote:
> >
> > > The the majority of queries will be look-ups of functions/types by
> fully
> > > qualified name. For example, the query [Data.Map.insert] will find the
> > > definition and all uses of the `insert` function defined in the
> > `Data.Map`
> > > module. The corpus is all Haskell open source code on
> > hackage.haskell.org.
> > >
> > > Being able to support qualified name queries is the main benefit of
> > > indexing the output of the compiler (which has resolved unqualified
> names
> > > to qualified names) rather than using a simple text-based indexing.
> > >
> > > There are three levels of name qualification I want to support in
> > queries:
> > >
> > >  * Unqualified: myFunction
> > >  * Module qualified: MyModule.myFunction
> > >  * Package and module qualified: mypackage-MyModule.myFunction
> > >
> > > I expect the middle one to be used the most. The last form is sometimes
> > > needed for disambiguation and the first is nice to support as a
> shorthand
> > > when the function name is unlikely to be ambiguous.
> > >
> > > For scoring I'd like to have a couple of attributes available. The most
> > > important one is whether a term represents a use site or a definition
> > site.
> > > This would allow the definition of a function to appear as the first
> > search
> > > result.
> > >
> > > Is this precise enough? Naturally the scope will grow over time, but
> this
> > > is the core of what I'm trying to do.
> > >
> > > -- Johan
> > >
> > >
> > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya 
> > > wrote:
> > >
> > > > Hi Johan,
> > > >
> > > > How you want to search, What is your search requirement and according
> > to
> > > > that you need to index. You could check duckduckgo or github code
> > search.
> > > >
> > > > The easiest approach would be to have a parser which will read each
> > > source
> > > > file and indexes as a single document. When you search, you will
> have a
> > > > single search field which will search the index and retrieves the
> > result.
> > > > The search field accepts any text in the source file. It could be
> > > function
> > > > name, class name, comments or variables etc.
> > > >
> > > > Another approach is to have different search fields for Functions,
> > > Classes,
> > > > Package etc.  You need to parse the file, identify comments, function
> > > name,
> > > > class name etc and index it in a separate field.
> > > >
> > > >
> > > > Regards
> > > > Aditya
> > > > www.findbestopensource.com
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell  >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'd like to index (Haskell) source code. I've run the source code
> > > > through a
> > > > > compiler (GHC) to get rich information about each token (its type,
> > > fully
> > > > > qualified name, etc) that I want to index (and later use when
> > ranking).
> > > > >
> > > > > I'm wondering how to approach indexing source code. I can see two
> > > > possible
> > > > > approaches:
> > > > >
> > > > >  * Create a fi

Re: How to approach indexing source code?

2014-06-05 Thread Johan Tibell
I will definitely try a prototype. My main question is whether I'm better
off creating documents directly or if I should try to parse the compiler
output using an analyzer/tokenizer.


On Thu, Jun 5, 2014 at 12:24 PM, Aditya 
wrote:

> It is up to your requirement. You could either index source file or
> compiler output. Try doing some proof of concept. You will get some idea of
> how to move forward.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
>
> On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell 
> wrote:
>
> > By "index the entire source file" do you mean "don't index the compiler
> > output"? If so, that doesn't sound very appealing as it loses most of the
> > benefit of having a search engine built for searching source code.
> >
> >
> > On Thu, Jun 5, 2014 at 11:11 AM, Aditya 
> > wrote:
> >
> > > Just keep it simple. Index the entire source file. One source file is
> one
> > > document. While indexing preserve dot (.), Hypen(-) and other special
> > > characters. You could use whitespace analyzer.
> > >
> > > I hope it helps
> > >
> > > Regards
> > > Aditya
> > > www.findbestopensource.com
> > >
> > >
> > > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell 
> > > wrote:
> > >
> > > > The the majority of queries will be look-ups of functions/types by
> > fully
> > > > qualified name. For example, the query [Data.Map.insert] will find
> the
> > > > definition and all uses of the `insert` function defined in the
> > > `Data.Map`
> > > > module. The corpus is all Haskell open source code on
> > > hackage.haskell.org.
> > > >
> > > > Being able to support qualified name queries is the main benefit of
> > > > indexing the output of the compiler (which has resolved unqualified
> > names
> > > > to qualified names) rather than using a simple text-based indexing.
> > > >
> > > > There are three levels of name qualification I want to support in
> > > queries:
> > > >
> > > >  * Unqualified: myFunction
> > > >  * Module qualified: MyModule.myFunction
> > > >  * Package and module qualified: mypackage-MyModule.myFunction
> > > >
> > > > I expect the middle one to be used the most. The last form is
> sometimes
> > > > needed for disambiguation and the first is nice to support as a
> > shorthand
> > > > when the function name is unlikely to be ambiguous.
> > > >
> > > > For scoring I'd like to have a couple of attributes available. The
> most
> > > > important one is whether a term represents a use site or a definition
> > > site.
> > > > This would allow the definition of a function to appear as the first
> > > search
> > > > result.
> > > >
> > > > Is this precise enough? Naturally the scope will grow over time, but
> > this
> > > > is the core of what I'm trying to do.
> > > >
> > > > -- Johan
> > > >
> > > >
> > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya  >
> > > > wrote:
> > > >
> > > > > Hi Johan,
> > > > >
> > > > > How you want to search, What is your search requirement and
> according
> > > to
> > > > > that you need to index. You could check duckduckgo or github code
> > > search.
> > > > >
> > > > > The easiest approach would be to have a parser which will read each
> > > > source
> > > > > file and indexes as a single document. When you search, you will
> > have a
> > > > > single search field which will search the index and retrieves the
> > > result.
> > > > > The search field accepts any text in the source file. It could be
> > > > function
> > > > > name, class name, comments or variables etc.
> > > > >
> > > > > Another approach is to have different search fields for Functions,
> > > > Classes,
> > > > > Package etc.  You need to parse the file, identify comments,
> function
> > > > name,
> > > > > class name etc and index it in a separate field.
> > > > >
> > > > >
> > > > > Regards
> > > > > Aditya
> > > > > www.findbestopensource.com
> > > > >
> > > > >
> > > 

Re: How to approach indexing source code?

2014-06-05 Thread Michael Sokolov
If you already have a parser for the language, you could use it to 
create a TokenStream that you can feed to Lucene.  That way you won't be 
trying to reinvent a parser using tools designed for natural language.


-Mike

On 6/5/2014 6:42 AM, Johan Tibell wrote:

I will definitely try a prototype. My main question is whether I'm better
off creating documents directly or if I should try to parse the compiler
output using an analyzer/tokenizer.


On Thu, Jun 5, 2014 at 12:24 PM, Aditya 
wrote:


It is up to your requirement. You could either index source file or
compiler output. Try doing some proof of concept. You will get some idea of
how to move forward.

Regards
Aditya
www.findbestopensource.com




On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell 
wrote:


By "index the entire source file" do you mean "don't index the compiler
output"? If so, that doesn't sound very appealing as it loses most of the
benefit of having a search engine built for searching source code.


On Thu, Jun 5, 2014 at 11:11 AM, Aditya 
wrote:


Just keep it simple. Index the entire source file. One source file is

one

document. While indexing preserve dot (.), Hypen(-) and other special
characters. You could use whitespace analyzer.

I hope it helps

Regards
Aditya
www.findbestopensource.com


On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell 
wrote:


The the majority of queries will be look-ups of functions/types by

fully

qualified name. For example, the query [Data.Map.insert] will find

the

definition and all uses of the `insert` function defined in the

`Data.Map`

module. The corpus is all Haskell open source code on

hackage.haskell.org.

Being able to support qualified name queries is the main benefit of
indexing the output of the compiler (which has resolved unqualified

names

to qualified names) rather than using a simple text-based indexing.

There are three levels of name qualification I want to support in

queries:

  * Unqualified: myFunction
  * Module qualified: MyModule.myFunction
  * Package and module qualified: mypackage-MyModule.myFunction

I expect the middle one to be used the most. The last form is

sometimes

needed for disambiguation and the first is nice to support as a

shorthand

when the function name is unlikely to be ambiguous.

For scoring I'd like to have a couple of attributes available. The

most

important one is whether a term represents a use site or a definition

site.

This would allow the definition of a function to appear as the first

search

result.

Is this precise enough? Naturally the scope will grow over time, but

this

is the core of what I'm trying to do.

-- Johan


On Wed, Jun 4, 2014 at 8:02 AM, Aditya 
Hi Johan,

How you want to search, What is your search requirement and

according

to

that you need to index. You could check duckduckgo or github code

search.

The easiest approach would be to have a parser which will read each

source

file and indexes as a single document. When you search, you will

have a

single search field which will search the index and retrieves the

result.

The search field accepts any text in the source file. It could be

function

name, class name, comments or variables etc.

Another approach is to have different search fields for Functions,

Classes,

Package etc.  You need to parse the file, identify comments,

function

name,

class name etc and index it in a separate field.


Regards
Aditya
www.findbestopensource.com




On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <

johan.tib...@gmail.com

wrote:


Hi,

I'd like to index (Haskell) source code. I've run the source code

through a

compiler (GHC) to get rich information about each token (its

type,

fully

qualified name, etc) that I want to index (and later use when

ranking).

I'm wondering how to approach indexing source code. I can see two

possible

approaches:

  * Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a

simple

line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function

name,

OffsetAttribute to write the source span, etc.

  * Use and IndexWriter to create a Document directly, as done

here:



http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more

likely

to

work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

  - Index several tokens at the same position, so I can index both

the

fully

qualified name (e.g. module.myFunction) and unqualified name

(e.g.

myFunction) for a term.

-- Johan




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org