Re: Lucene features
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote: > Lucene Users List <[EMAIL PROTECTED]> > > > I am wondering if Lucene is the way to go for my project. > > Probably. Tell us a little about your project. > > It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB > in size. They don't ever change, and are on a CD-ROM. Each file contains a > bunch of small documents. I just create one index for all 4 of them. These > documents are for an association that I belong to - they contain a history > of the association's documents - and my application allows you to search > them. Well, aside from your concerns about the second list, Lucene seems perfect for your needs. You'd parse apart the four big files into a bunch of small documents, the parse those small documents and create lucene Documents, containing Fields, and add them to the index. > They are actually currently indexed by an application called > 'Sonar', by Virginia Systems. But I REALLY didn't like using their > user interface - blech - so I decided to write a new interface for > my own use. But Sonar costs some real bucks to be able to develop > against their search API, so I found Lucene, and decided to go with > it. > > Here are the search features that 'Sonar' has : > Boolean Searching > Proximity Searching > Wild Card Searching > Field/Block Searching I'm not sure what Field/Block means. Boolean, Proximity and WildCard, are pretty typical in Lucene searches. You should probably take a look at the Query Parser syntax docs: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html > Relevancy Ranking / Date Ranking Lucene search results are typically ranked by relevance, and you can tweak the search to adjust this (there's a fair bit of discussion of this in the lucene-user archives, a good keyword to look for is "slop" and "boost"). Sorting output by date might take some finesse. I haven't played with sorting by date, but I'd expect to handle that by directly instantiating a QueryTerm to indicate the date issues. > List of Occurrences in Context I assume here that you mean displaying the results with a little snapshot of the text around it. There have been discussions about how best to do this (often focused around highlighting the search terms in the displayed text) on the lucene-users list. Check the list archive. > Phonetic Searching I'd guess you need to build this one yourself, perhaps by using a soundex algorithm when indexing the original data files. > Synonyms/Concepts Likewise... you'd need to come up with some sort of ontology of synonyms and concepts, then parse the fields you're indexing and generate a synonym/concept field that you'd add to the lucene Document. > Relational Searching > Associated Words > Drill Down Search Narrowing I'm not sure what these three mean. > I think that Lucene has all the features in the first group. How does it > stack up against the second group ? I'm afraid I haven't been too helpful here. Perhaps if you clarify what the above mean, folks can post about how to implement it in Lucene. > I'm writing the whole thing in Swing, which has been time consuming, > and so have invested quite a bit of time into this project. But I'm > seeing the end of the tunnel, and want to make sure that I'm going > down the right path before I spend too much more time on it. It sounds like you ought to at least seriously consider using Lucene, if you can find or implement equivalent features, or decide you can live without them. -- Steven J. Owens [EMAIL PROTECTED] "I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt." - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene app to index Java code
Hello, Has anyone written an application that uses Lucene to index Java code, either from the source .java files, or compiled .class files? I need to create a searchable index for Java code, so that I can use that index to check if classes or methods with certain functionality have already been written. This is an effort to remove code duplication and do more code re-use. If this application can also index Javadocs, even better! I think I heard of somebody doing this already. Kevin Burton? This is something that would fit nicely in Erik's Ant IndexTask in Lucene Sandbox), I think. Thank you, Otis __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Hi Otis, On Thursday, Sep 4, 2003, Otis Gospodnetic wrote: Has anyone written an application that uses Lucene to index Java code, either from the source .java files, or compiled .class files? If you are talking about my ultra secret project "Zapata: Coding Mexican Style", then yes ;) But... it uses runtime information to reach its devious ends and is more like a documentation tool than anything else... Anyway, this is how it goes: Given a set of binary jar files it builds an object graph of the bytecode: packages, classes, methods and so on. Complete with interdependencies and other handy informations. The bytecode is also run through a decompiler and pretty printed to normalize the source. Code segments are attached and indexed alongside their owners (class or method). All this fully indexed, searchable and cross referenced. This is built upon the same engine used by ZOE, so the end result is very much along the lines of what ZOE does for email, but for code instead... fun, fun, fun ;) Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
What you describe sounds interesting, but I was thinking more along the lines of this: http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/ An application that I could use to find out whether I already have a 'getStudents' or 'getStudents*' method somewhere in the source code, for instance, before I start writing it. As the code base grows larger, and as the team that works with it becomes bigger, this tools becomes more and more valuable. If this application could also index Javadocs, so that I can search for methods or classes that mention +student* +(database OR db) +update, that would be even better. Has anyone done this? Kevin Burton mentioned something similar to what I described above, at that URL, but it looks like he didn't make his application available. Thanks, Otis --- petite_abeille <[EMAIL PROTECTED]> wrote: > Hi Otis, > > On Thursday, Sep 4, 2003, Otis Gospodnetic wrote: > > > Has anyone written an application that uses Lucene to index Java > code, > > either from the source .java files, or compiled .class files? > > If you are talking about my ultra secret project "Zapata: Coding > Mexican Style", then yes ;) > > But... it uses runtime information to reach its devious ends and is > more like a documentation tool than anything else... > > Anyway, this is how it goes: > > Given a set of binary jar files it builds an object graph of the > bytecode: packages, classes, methods and so on. Complete with > interdependencies and other handy informations. The bytecode is also > run through a decompiler and pretty printed to normalize the source. > Code segments are attached and indexed alongside their owners (class > or > method). All this fully indexed, searchable and cross referenced. > > This is built upon the same engine used by ZOE, so the end result is > very much along the lines of what ZOE does for email, but for code > instead... fun, fun, fun ;) > > Cheers, > > PA. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
A couple of thoughts on this: - Eclipse uses Lucene for its code indexing/searching (I learned this at the OSCON Keynote by Eclipse folks). Perhaps looking at how Eclipse does its thing would be useful even if not the solution. - XDoclet could be used to sweep through Java code and build a text/XML file as richly as you'd like from the information there (complete with JavaDoc tags, which Zapata will miss :)), and then run Lucene on the generated files. On a related note, the XDoclet2 architecture would streamline this even further by eliminating the middle textual representation (QDox/XJavadoc reads Java as a "meta data provider" and then a Lucene "plugin" indexes things). It could be done without the intermediate text representation even in XDoclet 1.2, but it would require coding a custom subtask and be slightly out of the norm for XDoclet subtasks (but would work just fine). - My task could be used, but it would be better to use something that built a complete object-graph of all the source code you want indexed, so that it can deal with base classes, inherited javadoc tags, and other such interactions between classes you might want to capture. Erik On Thursday, September 4, 2003, at 07:18 AM, Otis Gospodnetic wrote: Hello, Has anyone written an application that uses Lucene to index Java code, either from the source .java files, or compiled .class files? I need to create a searchable index for Java code, so that I can use that index to check if classes or methods with certain functionality have already been written. This is an effort to remove code duplication and do more code re-use. If this application can also index Javadocs, even better! I think I heard of somebody doing this already. Kevin Burton? This is something that would fit nicely in Erik's Ant IndexTask in Lucene Sandbox), I think. Thank you, Otis __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Hi Erik, On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote: - XDoclet could be used to sweep through Java code and build a text/XML file as richly as you'd like from the information there (complete with JavaDoc tags, which Zapata will miss :)), Correct. This happen to be on purpose :) Does XDoclet build an "intertwingled" object graph of your code along the way? Performing a plain search on a code base is pretty trivial... what seems to be more interesting would be to put that in context. Zapata does something along the line of what MagicHat does for Objective-C: http://homepage.mac.com/petite_abeille/MagicHat/ But from the sound of what Otis is saying this is not what you guys are looking for... back to the pampa then... Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
On Thursday, September 4, 2003, at 09:19 AM, petite_abeille wrote: - XDoclet could be used to sweep through Java code and build a text/XML file as richly as you'd like from the information there (complete with JavaDoc tags, which Zapata will miss :)), Correct. This happen to be on purpose :) Does XDoclet build an "intertwingled" object graph of your code along the way? Performing a plain search on a code base is pretty trivial... what seems to be more interesting would be to put that in context. Yes, XDoclet builds a complete object graph of all the source files you hand it (as an Ant ). It actually even does binary class interpretation for the information it needs to construct a full object-graph if some dependencies are in the classpath of the taskdef as well. Zapata does something along the line of what MagicHat does for Objective-C: http://homepage.mac.com/petite_abeille/MagicHat/ Very cool. You rock! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StandardTokenizer problem
hy , when i use standardTokenizer for parse for example "I.B.M" the type of the Token is HOST and not ACRONYM WHY ??? in StandardTokenizer.jj // acronyms: U.S.A., I.B.M., etc. // use a post-filter to remove dots | "." ( ".")+ > // hostname | ("." )+ > "I.B.M" can be a host or acronym, so threre is a problem , no ? - Original Message - From: "petite_abeille" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, September 04, 2003 3:19 PM Subject: Re: Lucene app to index Java code > Hi Erik, > > On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote: > > > - XDoclet could be used to sweep through Java code and build a > > text/XML file as richly as you'd like from the information there > > (complete with JavaDoc tags, which Zapata will miss :)), > > Correct. This happen to be on purpose :) Does XDoclet build an > "intertwingled" object graph of your code along the way? Performing a > plain search on a code base is pretty trivial... what seems to be more > interesting would be to put that in context. > > Zapata does something along the line of what MagicHat does for > Objective-C: > > http://homepage.mac.com/petite_abeille/MagicHat/ > > But from the sound of what Otis is saying this is not what you guys are > looking for... back to the pampa then... > > Cheers, > > PA. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer problem
On Thursday, Sep 4, 2003, at 16:07 Europe/Zurich, Nicolas Maisonneuve wrote: "I.B.M" can be a host or acronym, so threre is a problem , no ? Perhaps as far as this parser goes... but... in practice... '.M' is not a valid TLD. PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Split results based on the value of a field
Hi, I have a requirement whereupon I'd like to pull search results back and split them up based on some keyword field. So for example, says there's a field named 'category', I'd like to be able to have the results displayed as such: Search Results for Category A: 1, 2, 3, Search Results for Category B: 1, 2, 3. The two ways I can think to do this are: 1) A post-process of the results collecting the first x amount of hits for each category. 2) Running a different search per category. This is probably a long shot, but I was wondering if the search itself has the means to filter out documents based on a limit of occurances of a value for a given search field. So for example if there are 5 categories, and we only want to show 5 results per category, then the maximum amount of hits returned would be 25. This is because the value 'Category A' for the field 'category' can only appear 5 times and so forth. Can anyone think of a way to achieve this? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mullti-term NOT queries
Hi, According to the QueryParser page: "The NOT operator cannot be used with just one term". Is this also true for multi-term NOT queries? E.g. NOT "jakarta apache" AND NOT "lucene" My tests suggest so, but I'd like to hear from someone who'd know for sure. Also, is this a limitation of the QueryParser or of the Query API itself? Thanks! Eugene. __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Otis Gospodnetic wrote: Hello, Has anyone written an application that uses Lucene to index Java code, either from the source .java files, or compiled .class files? I need to create a searchable index for Java code, so that I can use that index to check if classes or methods with certain functionality have already been written. This is an effort to remove code duplication and do more code re-use. If this application can also index Javadocs, even better! I think I heard of somebody doing this already. Kevin Burton? I was playing with it... blogged about it here... http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/ This is something that would fit nicely in Erik's Ant IndexTask in Lucene Sandbox), I think. Yes... I was thinking about making an ant task for it or using someone else's. One of the cool things would be direct integration within the IDE. Also parsing the .java file into a token stream and then indexing the tokens would make a blazingly fast doc completion facility Kevin -- Help Support NewsMonster Development! Purchase NewsMonster PRO! http://www.newsmonster.org/download-pro.html Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM - sfburtonator, Web - http://www.peerfear.org/ GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Otis Gospodnetic wrote: What you describe sounds interesting, but I was thinking more along the lines of this: http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/ An application that I could use to find out whether I already have a 'getStudents' or 'getStudents*' method somewhere in the source code, for instance, before I start writing it. As the code base grows larger, and as the team that works with it becomes bigger, this tools becomes more and more valuable. If this application could also index Javadocs, so that I can search for methods or classes that mention +student* +(database OR db) +update, that would be even better. Has anyone done this? Kevin Burton mentioned something similar to what I described above, at that URL, but it looks like he didn't make his application available. It's just two source files + Lucene plus I didn't do all the work to make it into an OSS package. 99% of OSS work isn't technical but political, maintenance, etc.. If someone wants to start an OSS project for this and do all the grunt work I will do the coding :) I don't know what parser I wnat to use to tokenize the source but a Doclet would be perfect for this The only problem is that this wouldn't allow full differential builds and would slow down the generation Also it just dawned on me that the Emacs compile-internal function parses stdout in the form of file:line# so this would make a great way to integrate for us Emacs geeks. Kevin -- Help Support NewsMonster Development! Purchase NewsMonster PRO! http://www.newsmonster.org/download-pro.html Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM - sfburtonator, Web - http://www.peerfear.org/ GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
Erik Hatcher wrote: A couple of thoughts on this: - Eclipse uses Lucene for its code indexing/searching (I learned this at the OSCON Keynote by Eclipse folks). Perhaps looking at how Eclipse does its thing would be useful even if not the solution. - XDoclet could be used to sweep through Java code and build a text/XML file as richly as you'd like from the information there (complete with JavaDoc tags, which Zapata will miss :)), and then run Lucene on the generated files. On a related note, the XDoclet2 architecture would streamline this even further by eliminating the middle textual representation (QDox/XJavadoc reads Java as a "meta data provider" and then a Lucene "plugin" indexes things). It could be done without the intermediate text representation even in XDoclet 1.2, but it would require coding a custom subtask and be slightly out of the norm for XDoclet subtasks (but would work just fine). It would be faster to write a native doclet as this would remove the XML parse overhead... The whole point of this thing is that it needs to be fast! Kevin -- Help Support NewsMonster Development! Purchase NewsMonster PRO! http://www.newsmonster.org/download-pro.html Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM - sfburtonator, Web - http://www.peerfear.org/ GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene app to index Java code
On Thursday, September 4, 2003, at 01:30 PM, Kevin A. Burton wrote: - XDoclet could be used to sweep through Java code and build a text/XML file as richly as you'd like from the information there (complete with JavaDoc tags, which Zapata will miss :)), and then run Lucene on the generated files. On a related note, the XDoclet2 architecture would streamline this even further by eliminating the middle textual representation (QDox/XJavadoc reads Java as a "meta data provider" and then a Lucene "plugin" indexes things). It could be done without the intermediate text representation even in XDoclet 1.2, but it would require coding a custom subtask and be slightly out of the norm for XDoclet subtasks (but would work just fine). It would be faster to write a native doclet as this would remove the XML parse overhead... The whole point of this thing is that it needs to be fast! Do you mean the Ant build file parsing? That would be the only XML parsing in the equation I'm proposing, unless you did it the clunkiest XDoclet 1.2 way of having an intermediate XML file. As for speed QDox, I've heard, is the fastest option. javadoc is the slowest parsing of the three I know of (javadoc, xjavadoc, qdox). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Performance of IndexWriter.addDirectory?
What's the performance of IndexWriter.addDirectory? I assume it isn't linear but is a function of the added index. Does the side of the target index matter? What about number of documents? Kevin -- Help Support NewsMonster Development! Purchase NewsMonster PRO! http://www.newsmonster.org/download-pro.html Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM - sfburtonator, Web - http://www.peerfear.org/ GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]