How to approach indexing source code?
Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking). I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqualified name (e.g. myFunction) for a term. -- Johan
Re: How to approach indexing source code?
The first question for any search app should always be: How do you intend to query the data? That will in large part determine how you should index the data. IOW, how do you intend to use the data? Be specific. Provide some sample queries and then work backwards to how the data needs to be indexed. -- Jack Krupansky -Original Message- From: Johan Tibell Sent: Tuesday, June 3, 2014 9:32 PM To: java-user@lucene.apache.org Subject: How to approach indexing source code? Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking). I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqualified name (e.g. myFunction) for a term. -- Johan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to approach indexing source code?
Hi Johan, How you want to search, What is your search requirement and according to that you need to index. You could check duckduckgo or github code search. The easiest approach would be to have a parser which will read each source file and indexes as a single document. When you search, you will have a single search field which will search the index and retrieves the result. The search field accepts any text in the source file. It could be function name, class name, comments or variables etc. Another approach is to have different search fields for Functions, Classes, Package etc. You need to parse the file, identify comments, function name, class name etc and index it in a separate field. Regards Aditya www.findbestopensource.com On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell wrote: > Hi, > > I'd like to index (Haskell) source code. I've run the source code through a > compiler (GHC) to get rich information about each token (its type, fully > qualified name, etc) that I want to index (and later use when ranking). > > I'm wondering how to approach indexing source code. I can see two possible > approaches: > > * Create a file containing all the metadata and write a custom > tokenizer/analyzer that processes the file. The file could use a simple > line-based format: > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > myFunction,5:11-5:21,my-package,used-here,more-metadata > ... > > The tokenizer would use CharTermAttribute to write the function name, > OffsetAttribute to write the source span, etc. > > * Use and IndexWriter to create a Document directly, as done here: > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > I'm new to Lucene so I can't quite tell which approach is more likely to > work well. Which way would you recommend? > > Other things I'd like to do that might influence the answer: > > - Index several tokens at the same position, so I can index both the fully > qualified name (e.g. module.myFunction) and unqualified name (e.g. > myFunction) for a term. > > -- Johan >
Re: How to approach indexing source code?
The the majority of queries will be look-ups of functions/types by fully qualified name. For example, the query [Data.Map.insert] will find the definition and all uses of the `insert` function defined in the `Data.Map` module. The corpus is all Haskell open source code on hackage.haskell.org. Being able to support qualified name queries is the main benefit of indexing the output of the compiler (which has resolved unqualified names to qualified names) rather than using a simple text-based indexing. There are three levels of name qualification I want to support in queries: * Unqualified: myFunction * Module qualified: MyModule.myFunction * Package and module qualified: mypackage-MyModule.myFunction I expect the middle one to be used the most. The last form is sometimes needed for disambiguation and the first is nice to support as a shorthand when the function name is unlikely to be ambiguous. For scoring I'd like to have a couple of attributes available. The most important one is whether a term represents a use site or a definition site. This would allow the definition of a function to appear as the first search result. Is this precise enough? Naturally the scope will grow over time, but this is the core of what I'm trying to do. -- Johan On Wed, Jun 4, 2014 at 8:02 AM, Aditya wrote: > Hi Johan, > > How you want to search, What is your search requirement and according to > that you need to index. You could check duckduckgo or github code search. > > The easiest approach would be to have a parser which will read each source > file and indexes as a single document. When you search, you will have a > single search field which will search the index and retrieves the result. > The search field accepts any text in the source file. It could be function > name, class name, comments or variables etc. > > Another approach is to have different search fields for Functions, Classes, > Package etc. You need to parse the file, identify comments, function name, > class name etc and index it in a separate field. > > > Regards > Aditya > www.findbestopensource.com > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell > wrote: > > > Hi, > > > > I'd like to index (Haskell) source code. I've run the source code > through a > > compiler (GHC) to get rich information about each token (its type, fully > > qualified name, etc) that I want to index (and later use when ranking). > > > > I'm wondering how to approach indexing source code. I can see two > possible > > approaches: > > > > * Create a file containing all the metadata and write a custom > > tokenizer/analyzer that processes the file. The file could use a simple > > line-based format: > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > > myFunction,5:11-5:21,my-package,used-here,more-metadata > > ... > > > > The tokenizer would use CharTermAttribute to write the function name, > > OffsetAttribute to write the source span, etc. > > > > * Use and IndexWriter to create a Document directly, as done here: > > > > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > > > I'm new to Lucene so I can't quite tell which approach is more likely to > > work well. Which way would you recommend? > > > > Other things I'd like to do that might influence the answer: > > > > - Index several tokens at the same position, so I can index both the > fully > > qualified name (e.g. module.myFunction) and unqualified name (e.g. > > myFunction) for a term. > > > > -- Johan > > >
Re: How to approach indexing source code?
Probably the simplest thing is to define a field for each of the contexts you are interested in, but you might want to consider using a tagged-token approach. I spent a while figuring out how to index tagged tree-structured data and came up with Lux (http://luxdb.org) - basically it accepts XML and indexes all the text using tagname prefixes; each word gets indexed as itself, and with tags as prefixes (something like: Data.Map.insert ; function-definition:Data.Map.Insert; function-call:Data.Map.Insert, etc). So one approach would be to convert your syntax tree into XML and use a generic XML indexing solution (there are others) based on Lucene. Or you could just borrow the same idea and build your own TokenStream that produces tagged tokens. With this tagged token approach, you don't need to define a field for every different possible tag; you can just use a generic tagged-text field, and include the tag as part of the indexed token in that field. It also makes it possible to perform proximity queries with tokens that have different tags; I don't know if it is possible to do that when the tokens are in different fields. Another option is to use payloads to store additional information about each token; if you search for part-of-speech tagging with Lucene you should find a lot of discussion about a parallel use case (people want to tag words as verbs, nouns, etc). I seem to remember someone using payloads for that, although I think that involves more low-level Lucene programming than the tagged-token approach I described above. -Mike On 6/4/2014 5:59 AM, Johan Tibell wrote: The the majority of queries will be look-ups of functions/types by fully qualified name. For example, the query [Data.Map.insert] will find the definition and all uses of the `insert` function defined in the `Data.Map` module. The corpus is all Haskell open source code on hackage.haskell.org. Being able to support qualified name queries is the main benefit of indexing the output of the compiler (which has resolved unqualified names to qualified names) rather than using a simple text-based indexing. There are three levels of name qualification I want to support in queries: * Unqualified: myFunction * Module qualified: MyModule.myFunction * Package and module qualified: mypackage-MyModule.myFunction I expect the middle one to be used the most. The last form is sometimes needed for disambiguation and the first is nice to support as a shorthand when the function name is unlikely to be ambiguous. For scoring I'd like to have a couple of attributes available. The most important one is whether a term represents a use site or a definition site. This would allow the definition of a function to appear as the first search result. Is this precise enough? Naturally the scope will grow over time, but this is the core of what I'm trying to do. -- Johan On Wed, Jun 4, 2014 at 8:02 AM, Aditya wrote: Hi Johan, How you want to search, What is your search requirement and according to that you need to index. You could check duckduckgo or github code search. The easiest approach would be to have a parser which will read each source file and indexes as a single document. When you search, you will have a single search field which will search the index and retrieves the result. The search field accepts any text in the source file. It could be function name, class name, comments or variables etc. Another approach is to have different search fields for Functions, Classes, Package etc. You need to parse the file, identify comments, function name, class name etc and index it in a separate field. Regards Aditya www.findbestopensource.com On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell wrote: Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking). I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqual
Re: How to approach indexing source code?
Just keep it simple. Index the entire source file. One source file is one document. While indexing preserve dot (.), Hypen(-) and other special characters. You could use whitespace analyzer. I hope it helps Regards Aditya www.findbestopensource.com On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell wrote: > The the majority of queries will be look-ups of functions/types by fully > qualified name. For example, the query [Data.Map.insert] will find the > definition and all uses of the `insert` function defined in the `Data.Map` > module. The corpus is all Haskell open source code on hackage.haskell.org. > > Being able to support qualified name queries is the main benefit of > indexing the output of the compiler (which has resolved unqualified names > to qualified names) rather than using a simple text-based indexing. > > There are three levels of name qualification I want to support in queries: > > * Unqualified: myFunction > * Module qualified: MyModule.myFunction > * Package and module qualified: mypackage-MyModule.myFunction > > I expect the middle one to be used the most. The last form is sometimes > needed for disambiguation and the first is nice to support as a shorthand > when the function name is unlikely to be ambiguous. > > For scoring I'd like to have a couple of attributes available. The most > important one is whether a term represents a use site or a definition site. > This would allow the definition of a function to appear as the first search > result. > > Is this precise enough? Naturally the scope will grow over time, but this > is the core of what I'm trying to do. > > -- Johan > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya > wrote: > > > Hi Johan, > > > > How you want to search, What is your search requirement and according to > > that you need to index. You could check duckduckgo or github code search. > > > > The easiest approach would be to have a parser which will read each > source > > file and indexes as a single document. When you search, you will have a > > single search field which will search the index and retrieves the result. > > The search field accepts any text in the source file. It could be > function > > name, class name, comments or variables etc. > > > > Another approach is to have different search fields for Functions, > Classes, > > Package etc. You need to parse the file, identify comments, function > name, > > class name etc and index it in a separate field. > > > > > > Regards > > Aditya > > www.findbestopensource.com > > > > > > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell > > wrote: > > > > > Hi, > > > > > > I'd like to index (Haskell) source code. I've run the source code > > through a > > > compiler (GHC) to get rich information about each token (its type, > fully > > > qualified name, etc) that I want to index (and later use when ranking). > > > > > > I'm wondering how to approach indexing source code. I can see two > > possible > > > approaches: > > > > > > * Create a file containing all the metadata and write a custom > > > tokenizer/analyzer that processes the file. The file could use a simple > > > line-based format: > > > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > > > myFunction,5:11-5:21,my-package,used-here,more-metadata > > > ... > > > > > > The tokenizer would use CharTermAttribute to write the function name, > > > OffsetAttribute to write the source span, etc. > > > > > > * Use and IndexWriter to create a Document directly, as done here: > > > > > > > > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > > > > > I'm new to Lucene so I can't quite tell which approach is more likely > to > > > work well. Which way would you recommend? > > > > > > Other things I'd like to do that might influence the answer: > > > > > > - Index several tokens at the same position, so I can index both the > > fully > > > qualified name (e.g. module.myFunction) and unqualified name (e.g. > > > myFunction) for a term. > > > > > > -- Johan > > > > > >
Re: How to approach indexing source code?
By "index the entire source file" do you mean "don't index the compiler output"? If so, that doesn't sound very appealing as it loses most of the benefit of having a search engine built for searching source code. On Thu, Jun 5, 2014 at 11:11 AM, Aditya wrote: > Just keep it simple. Index the entire source file. One source file is one > document. While indexing preserve dot (.), Hypen(-) and other special > characters. You could use whitespace analyzer. > > I hope it helps > > Regards > Aditya > www.findbestopensource.com > > > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell > wrote: > > > The the majority of queries will be look-ups of functions/types by fully > > qualified name. For example, the query [Data.Map.insert] will find the > > definition and all uses of the `insert` function defined in the > `Data.Map` > > module. The corpus is all Haskell open source code on > hackage.haskell.org. > > > > Being able to support qualified name queries is the main benefit of > > indexing the output of the compiler (which has resolved unqualified names > > to qualified names) rather than using a simple text-based indexing. > > > > There are three levels of name qualification I want to support in > queries: > > > > * Unqualified: myFunction > > * Module qualified: MyModule.myFunction > > * Package and module qualified: mypackage-MyModule.myFunction > > > > I expect the middle one to be used the most. The last form is sometimes > > needed for disambiguation and the first is nice to support as a shorthand > > when the function name is unlikely to be ambiguous. > > > > For scoring I'd like to have a couple of attributes available. The most > > important one is whether a term represents a use site or a definition > site. > > This would allow the definition of a function to appear as the first > search > > result. > > > > Is this precise enough? Naturally the scope will grow over time, but this > > is the core of what I'm trying to do. > > > > -- Johan > > > > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya > > wrote: > > > > > Hi Johan, > > > > > > How you want to search, What is your search requirement and according > to > > > that you need to index. You could check duckduckgo or github code > search. > > > > > > The easiest approach would be to have a parser which will read each > > source > > > file and indexes as a single document. When you search, you will have a > > > single search field which will search the index and retrieves the > result. > > > The search field accepts any text in the source file. It could be > > function > > > name, class name, comments or variables etc. > > > > > > Another approach is to have different search fields for Functions, > > Classes, > > > Package etc. You need to parse the file, identify comments, function > > name, > > > class name etc and index it in a separate field. > > > > > > > > > Regards > > > Aditya > > > www.findbestopensource.com > > > > > > > > > > > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell > > > wrote: > > > > > > > Hi, > > > > > > > > I'd like to index (Haskell) source code. I've run the source code > > > through a > > > > compiler (GHC) to get rich information about each token (its type, > > fully > > > > qualified name, etc) that I want to index (and later use when > ranking). > > > > > > > > I'm wondering how to approach indexing source code. I can see two > > > possible > > > > approaches: > > > > > > > > * Create a file containing all the metadata and write a custom > > > > tokenizer/analyzer that processes the file. The file could use a > simple > > > > line-based format: > > > > > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > > > > myFunction,5:11-5:21,my-package,used-here,more-metadata > > > > ... > > > > > > > > The tokenizer would use CharTermAttribute to write the function name, > > > > OffsetAttribute to write the source span, etc. > > > > > > > > * Use and IndexWriter to create a Document directly, as done here: > > > > > > > > > > > > > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > > > > > > > I'm new to Lucene so I can't quite tell which approach is more likely > > to > > > > work well. Which way would you recommend? > > > > > > > > Other things I'd like to do that might influence the answer: > > > > > > > > - Index several tokens at the same position, so I can index both the > > > fully > > > > qualified name (e.g. module.myFunction) and unqualified name (e.g. > > > > myFunction) for a term. > > > > > > > > -- Johan > > > > > > > > > >
Re: How to approach indexing source code?
It is up to your requirement. You could either index source file or compiler output. Try doing some proof of concept. You will get some idea of how to move forward. Regards Aditya www.findbestopensource.com On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell wrote: > By "index the entire source file" do you mean "don't index the compiler > output"? If so, that doesn't sound very appealing as it loses most of the > benefit of having a search engine built for searching source code. > > > On Thu, Jun 5, 2014 at 11:11 AM, Aditya > wrote: > > > Just keep it simple. Index the entire source file. One source file is one > > document. While indexing preserve dot (.), Hypen(-) and other special > > characters. You could use whitespace analyzer. > > > > I hope it helps > > > > Regards > > Aditya > > www.findbestopensource.com > > > > > > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell > > wrote: > > > > > The the majority of queries will be look-ups of functions/types by > fully > > > qualified name. For example, the query [Data.Map.insert] will find the > > > definition and all uses of the `insert` function defined in the > > `Data.Map` > > > module. The corpus is all Haskell open source code on > > hackage.haskell.org. > > > > > > Being able to support qualified name queries is the main benefit of > > > indexing the output of the compiler (which has resolved unqualified > names > > > to qualified names) rather than using a simple text-based indexing. > > > > > > There are three levels of name qualification I want to support in > > queries: > > > > > > * Unqualified: myFunction > > > * Module qualified: MyModule.myFunction > > > * Package and module qualified: mypackage-MyModule.myFunction > > > > > > I expect the middle one to be used the most. The last form is sometimes > > > needed for disambiguation and the first is nice to support as a > shorthand > > > when the function name is unlikely to be ambiguous. > > > > > > For scoring I'd like to have a couple of attributes available. The most > > > important one is whether a term represents a use site or a definition > > site. > > > This would allow the definition of a function to appear as the first > > search > > > result. > > > > > > Is this precise enough? Naturally the scope will grow over time, but > this > > > is the core of what I'm trying to do. > > > > > > -- Johan > > > > > > > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya > > > wrote: > > > > > > > Hi Johan, > > > > > > > > How you want to search, What is your search requirement and according > > to > > > > that you need to index. You could check duckduckgo or github code > > search. > > > > > > > > The easiest approach would be to have a parser which will read each > > > source > > > > file and indexes as a single document. When you search, you will > have a > > > > single search field which will search the index and retrieves the > > result. > > > > The search field accepts any text in the source file. It could be > > > function > > > > name, class name, comments or variables etc. > > > > > > > > Another approach is to have different search fields for Functions, > > > Classes, > > > > Package etc. You need to parse the file, identify comments, function > > > name, > > > > class name etc and index it in a separate field. > > > > > > > > > > > > Regards > > > > Aditya > > > > www.findbestopensource.com > > > > > > > > > > > > > > > > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I'd like to index (Haskell) source code. I've run the source code > > > > through a > > > > > compiler (GHC) to get rich information about each token (its type, > > > fully > > > > > qualified name, etc) that I want to index (and later use when > > ranking). > > > > > > > > > > I'm wondering how to approach indexing source code. I can see two > > > > possible > > > > > approaches: > > > > > > > > > > * Create a fi
Re: How to approach indexing source code?
I will definitely try a prototype. My main question is whether I'm better off creating documents directly or if I should try to parse the compiler output using an analyzer/tokenizer. On Thu, Jun 5, 2014 at 12:24 PM, Aditya wrote: > It is up to your requirement. You could either index source file or > compiler output. Try doing some proof of concept. You will get some idea of > how to move forward. > > Regards > Aditya > www.findbestopensource.com > > > > > On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell > wrote: > > > By "index the entire source file" do you mean "don't index the compiler > > output"? If so, that doesn't sound very appealing as it loses most of the > > benefit of having a search engine built for searching source code. > > > > > > On Thu, Jun 5, 2014 at 11:11 AM, Aditya > > wrote: > > > > > Just keep it simple. Index the entire source file. One source file is > one > > > document. While indexing preserve dot (.), Hypen(-) and other special > > > characters. You could use whitespace analyzer. > > > > > > I hope it helps > > > > > > Regards > > > Aditya > > > www.findbestopensource.com > > > > > > > > > On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell > > > wrote: > > > > > > > The the majority of queries will be look-ups of functions/types by > > fully > > > > qualified name. For example, the query [Data.Map.insert] will find > the > > > > definition and all uses of the `insert` function defined in the > > > `Data.Map` > > > > module. The corpus is all Haskell open source code on > > > hackage.haskell.org. > > > > > > > > Being able to support qualified name queries is the main benefit of > > > > indexing the output of the compiler (which has resolved unqualified > > names > > > > to qualified names) rather than using a simple text-based indexing. > > > > > > > > There are three levels of name qualification I want to support in > > > queries: > > > > > > > > * Unqualified: myFunction > > > > * Module qualified: MyModule.myFunction > > > > * Package and module qualified: mypackage-MyModule.myFunction > > > > > > > > I expect the middle one to be used the most. The last form is > sometimes > > > > needed for disambiguation and the first is nice to support as a > > shorthand > > > > when the function name is unlikely to be ambiguous. > > > > > > > > For scoring I'd like to have a couple of attributes available. The > most > > > > important one is whether a term represents a use site or a definition > > > site. > > > > This would allow the definition of a function to appear as the first > > > search > > > > result. > > > > > > > > Is this precise enough? Naturally the scope will grow over time, but > > this > > > > is the core of what I'm trying to do. > > > > > > > > -- Johan > > > > > > > > > > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya > > > > > wrote: > > > > > > > > > Hi Johan, > > > > > > > > > > How you want to search, What is your search requirement and > according > > > to > > > > > that you need to index. You could check duckduckgo or github code > > > search. > > > > > > > > > > The easiest approach would be to have a parser which will read each > > > > source > > > > > file and indexes as a single document. When you search, you will > > have a > > > > > single search field which will search the index and retrieves the > > > result. > > > > > The search field accepts any text in the source file. It could be > > > > function > > > > > name, class name, comments or variables etc. > > > > > > > > > > Another approach is to have different search fields for Functions, > > > > Classes, > > > > > Package etc. You need to parse the file, identify comments, > function > > > > name, > > > > > class name etc and index it in a separate field. > > > > > > > > > > > > > > > Regards > > > > > Aditya > > > > > www.findbestopensource.com > > > > > > > > > > > > >
Re: How to approach indexing source code?
If you already have a parser for the language, you could use it to create a TokenStream that you can feed to Lucene. That way you won't be trying to reinvent a parser using tools designed for natural language. -Mike On 6/5/2014 6:42 AM, Johan Tibell wrote: I will definitely try a prototype. My main question is whether I'm better off creating documents directly or if I should try to parse the compiler output using an analyzer/tokenizer. On Thu, Jun 5, 2014 at 12:24 PM, Aditya wrote: It is up to your requirement. You could either index source file or compiler output. Try doing some proof of concept. You will get some idea of how to move forward. Regards Aditya www.findbestopensource.com On Thu, Jun 5, 2014 at 2:48 PM, Johan Tibell wrote: By "index the entire source file" do you mean "don't index the compiler output"? If so, that doesn't sound very appealing as it loses most of the benefit of having a search engine built for searching source code. On Thu, Jun 5, 2014 at 11:11 AM, Aditya wrote: Just keep it simple. Index the entire source file. One source file is one document. While indexing preserve dot (.), Hypen(-) and other special characters. You could use whitespace analyzer. I hope it helps Regards Aditya www.findbestopensource.com On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell wrote: The the majority of queries will be look-ups of functions/types by fully qualified name. For example, the query [Data.Map.insert] will find the definition and all uses of the `insert` function defined in the `Data.Map` module. The corpus is all Haskell open source code on hackage.haskell.org. Being able to support qualified name queries is the main benefit of indexing the output of the compiler (which has resolved unqualified names to qualified names) rather than using a simple text-based indexing. There are three levels of name qualification I want to support in queries: * Unqualified: myFunction * Module qualified: MyModule.myFunction * Package and module qualified: mypackage-MyModule.myFunction I expect the middle one to be used the most. The last form is sometimes needed for disambiguation and the first is nice to support as a shorthand when the function name is unlikely to be ambiguous. For scoring I'd like to have a couple of attributes available. The most important one is whether a term represents a use site or a definition site. This would allow the definition of a function to appear as the first search result. Is this precise enough? Naturally the scope will grow over time, but this is the core of what I'm trying to do. -- Johan On Wed, Jun 4, 2014 at 8:02 AM, Aditya Hi Johan, How you want to search, What is your search requirement and according to that you need to index. You could check duckduckgo or github code search. The easiest approach would be to have a parser which will read each source file and indexes as a single document. When you search, you will have a single search field which will search the index and retrieves the result. The search field accepts any text in the source file. It could be function name, class name, comments or variables etc. Another approach is to have different search fields for Functions, Classes, Package etc. You need to parse the file, identify comments, function name, class name etc and index it in a separate field. Regards Aditya www.findbestopensource.com On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell < johan.tib...@gmail.com wrote: Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking). I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqualified name (e.g. myFunction) for a term. -- Johan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org