Re: Lucene features

2003-09-04 Thread Steven J. Owens
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
> Lucene Users List <[EMAIL PROTECTED]>
> > > I am wondering if Lucene is the way to go for my project.
> >  Probably.  Tell us a little about your project.
> 
> It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
> in size. They don't ever change, and are on a CD-ROM. Each file contains a
> bunch of small documents. I just create one index for all 4 of them. These
> documents are for an association that I belong to - they contain a history
> of the association's documents - and my application allows you to search
> them.

 Well, aside from your concerns about the second list, Lucene
seems perfect for your needs.  You'd parse apart the four big files
into a bunch of small documents, the parse those small documents and
create lucene Documents, containing Fields, and add them to the index.
 
> They are actually currently indexed by an application called
> 'Sonar', by Virginia Systems. But I REALLY didn't like using their
> user interface - blech - so I decided to write a new interface for
> my own use. But Sonar costs some real bucks to be able to develop
> against their search API, so I found Lucene, and decided to go with
> it.
> 
> Here are the search features that 'Sonar' has :
>   Boolean Searching
>   Proximity Searching
>   Wild Card Searching
>   Field/Block Searching

 I'm not sure what Field/Block means.  Boolean, Proximity and
WildCard, are pretty typical in Lucene searches.  You should probably
take a look at the Query Parser syntax docs:

 http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


>   Relevancy Ranking / Date Ranking

 Lucene search results are typically ranked by relevance, and you
can tweak the search to adjust this (there's a fair bit of discussion
of this in the lucene-user archives, a good keyword to look for is
"slop" and "boost").

 Sorting output by date might take some finesse.  I haven't played
with sorting by date, but I'd expect to handle that by directly
instantiating a QueryTerm to indicate the date issues.

>   List of Occurrences in Context

 I assume here that you mean displaying the results with a little
snapshot of the text around it.  There have been discussions about how
best to do this (often focused around highlighting the search terms in
the displayed text) on the lucene-users list.  Check the list archive.
 
>   Phonetic Searching

 I'd guess you need to build this one yourself, perhaps by using a
soundex algorithm when indexing the original data files.

>   Synonyms/Concepts

 Likewise... you'd need to come up with some sort of ontology of
synonyms and concepts, then parse the fields you're indexing and
generate a synonym/concept field that you'd add to the lucene
Document.

>   Relational Searching
>   Associated Words
>   Drill Down Search Narrowing

 I'm not sure what these three mean.

> I think that Lucene has all the features in the first group. How does it
> stack up against the second group ?

 I'm afraid I haven't been too helpful here.  Perhaps if you
clarify what the above mean, folks can post about how to implement it
in Lucene.

> I'm writing the whole thing in Swing, which has been time consuming,
> and so have invested quite a bit of time into this project. But I'm
> seeing the end of the tunnel, and want to make sure that I'm going
> down the right path before I spend too much more time on it.

 It sounds like you ought to at least seriously consider using
Lucene, if you can find or implement equivalent features, or decide
you can live without them.

-- 
Steven J. Owens
[EMAIL PROTECTED]

"I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt." - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene app to index Java code

2003-09-04 Thread Otis Gospodnetic
Hello,

Has anyone written an application that uses Lucene to index Java code,
either from the source .java files, or compiled .class files?

I need to create a searchable index for Java code, so that I can use
that index to check if classes or methods with certain functionality
have already been written.  This is an effort to remove code
duplication and do more code re-use.  If this application can also
index Javadocs, even better!

I think I heard of somebody doing this already.  Kevin Burton?
This is something that would fit nicely in Erik's Ant IndexTask in
Lucene Sandbox), I think.

Thank you,
Otis


__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene app to index Java code

2003-09-04 Thread petite_abeille
Hi Otis,

On Thursday, Sep 4, 2003, Otis Gospodnetic wrote:

Has anyone written an application that uses Lucene to index Java code,
either from the source .java files, or compiled .class files?
If you are talking about my ultra secret project "Zapata: Coding 
Mexican Style", then yes ;)

But... it uses runtime information to reach its devious ends and is 
more like a documentation tool than anything else...

Anyway, this is how it goes:

Given a set of binary jar files it builds an object graph of the 
bytecode: packages, classes, methods and so on. Complete with 
interdependencies and other handy informations. The bytecode is also 
run through a decompiler and pretty printed to normalize the source. 
Code segments are attached and indexed alongside their owners (class or 
method). All this fully indexed, searchable and cross referenced.

This is built upon the same engine used by ZOE, so the end result is 
very much along the lines of what ZOE does for email, but for code 
instead... fun, fun, fun ;)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread Otis Gospodnetic
What you describe sounds interesting, but I was thinking more along the
lines of this:

http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/

An application that I could use to find out whether I already have a
'getStudents' or 'getStudents*' method somewhere in the source code,
for instance, before I start writing it.  As the code base grows
larger, and as the team that works with it becomes bigger, this tools
becomes more and more valuable.
If this application could also index Javadocs, so that I can search for
methods or classes that mention +student* +(database OR db) +update,
that would be even better.

Has anyone done this?
Kevin Burton mentioned something similar to what I described above, at
that URL, but it looks like he didn't make his application available.

Thanks,
Otis

--- petite_abeille <[EMAIL PROTECTED]> wrote:
> Hi Otis,
> 
> On Thursday, Sep 4, 2003, Otis Gospodnetic wrote:
> 
> > Has anyone written an application that uses Lucene to index Java
> code,
> > either from the source .java files, or compiled .class files?
> 
> If you are talking about my ultra secret project "Zapata: Coding 
> Mexican Style", then yes ;)
> 
> But... it uses runtime information to reach its devious ends and is 
> more like a documentation tool than anything else...
> 
> Anyway, this is how it goes:
> 
> Given a set of binary jar files it builds an object graph of the 
> bytecode: packages, classes, methods and so on. Complete with 
> interdependencies and other handy informations. The bytecode is also 
> run through a decompiler and pretty printed to normalize the source. 
> Code segments are attached and indexed alongside their owners (class
> or 
> method). All this fully indexed, searchable and cross referenced.
> 
> This is built upon the same engine used by ZOE, so the end result is 
> very much along the lines of what ZOE does for email, but for code 
> instead... fun, fun, fun ;)
> 
> Cheers,
> 
> PA.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene app to index Java code

2003-09-04 Thread Erik Hatcher
A couple of thoughts on this:

- Eclipse uses Lucene for its code indexing/searching (I learned this 
at the OSCON Keynote by Eclipse folks).  Perhaps looking at how Eclipse 
does its thing would be useful even if not the solution.

- XDoclet could be used to sweep through Java code and build a text/XML 
file as richly as you'd like from the information there (complete with 
JavaDoc tags, which Zapata will miss :)), and then run Lucene on the 
generated files.  On a related note, the XDoclet2 architecture would 
streamline this even further by eliminating the middle textual 
representation (QDox/XJavadoc reads Java as a "meta data provider" and 
then a Lucene "plugin" indexes things).  It could be done without the 
intermediate text representation even in XDoclet 1.2, but it would 
require coding a custom subtask and be slightly out of the norm for 
XDoclet subtasks (but would work just fine).

- My  task could be used, but it would be better to use 
something that built a complete object-graph of all the source code you 
want indexed, so that it can deal with base classes, inherited javadoc 
tags, and other such interactions between classes you might want to 
capture.

	Erik

On Thursday, September 4, 2003, at 07:18  AM, Otis Gospodnetic wrote:

Hello,

Has anyone written an application that uses Lucene to index Java code,
either from the source .java files, or compiled .class files?
I need to create a searchable index for Java code, so that I can use
that index to check if classes or methods with certain functionality
have already been written.  This is an effort to remove code
duplication and do more code re-use.  If this application can also
index Javadocs, even better!
I think I heard of somebody doing this already.  Kevin Burton?
This is something that would fit nicely in Erik's Ant IndexTask in
Lucene Sandbox), I think.
Thank you,
Otis
__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread petite_abeille
Hi Erik,

On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote:

- XDoclet could be used to sweep through Java code and build a 
text/XML file as richly as you'd like from the information there 
(complete with JavaDoc tags, which Zapata will miss :)),
Correct. This happen to be on purpose :) Does XDoclet build an 
"intertwingled" object graph of your code along the way? Performing a 
plain search on a code base is pretty trivial... what seems to be more 
interesting would be to put that in context.

Zapata does something along the line of what MagicHat does for 
Objective-C:

http://homepage.mac.com/petite_abeille/MagicHat/

But from the sound of what Otis is saying this is not what you guys are 
looking for... back to the pampa then...

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread Erik Hatcher
On Thursday, September 4, 2003, at 09:19  AM, petite_abeille wrote:
- XDoclet could be used to sweep through Java code and build a 
text/XML file as richly as you'd like from the information there 
(complete with JavaDoc tags, which Zapata will miss :)),
Correct. This happen to be on purpose :) Does XDoclet build an 
"intertwingled" object graph of your code along the way? Performing a 
plain search on a code base is pretty trivial... what seems to be more 
interesting would be to put that in context.
Yes, XDoclet builds a complete object graph of all the source files you 
hand it (as an Ant ).  It actually even does binary class 
interpretation for the information it needs to construct a full 
object-graph if some dependencies are in the classpath of the taskdef 
as well.

Zapata does something along the line of what MagicHat does for 
Objective-C:

http://homepage.mac.com/petite_abeille/MagicHat/
Very cool.  You rock!

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


StandardTokenizer problem

2003-09-04 Thread Nicolas Maisonneuve
hy ,
when i use standardTokenizer
for parse for example "I.B.M"
the type of the Token  is HOST and not ACRONYM

WHY ???

in StandardTokenizer.jj

 // acronyms: U.S.A., I.B.M., etc.
  // use a post-filter to remove dots
|  "." ( ".")+ >

  // hostname
|  ("." )+ >

"I.B.M" can be a host or acronym, so threre is a problem , no  ?

- Original Message - 
From: "petite_abeille" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, September 04, 2003 3:19 PM
Subject: Re: Lucene app to index Java code


> Hi Erik,
> 
> On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote:
> 
> > - XDoclet could be used to sweep through Java code and build a 
> > text/XML file as richly as you'd like from the information there 
> > (complete with JavaDoc tags, which Zapata will miss :)),
> 
> Correct. This happen to be on purpose :) Does XDoclet build an 
> "intertwingled" object graph of your code along the way? Performing a 
> plain search on a code base is pretty trivial... what seems to be more 
> interesting would be to put that in context.
> 
> Zapata does something along the line of what MagicHat does for 
> Objective-C:
> 
> http://homepage.mac.com/petite_abeille/MagicHat/
> 
> But from the sound of what Otis is saying this is not what you guys are 
> looking for... back to the pampa then...
> 
> Cheers,
> 
> PA.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer problem

2003-09-04 Thread petite_abeille
On Thursday, Sep 4, 2003, at 16:07 Europe/Zurich, Nicolas Maisonneuve 
wrote:

"I.B.M" can be a host or acronym, so threre is a problem , no  ?
Perhaps as far as this parser goes... but... in practice... '.M' is not 
a valid TLD.

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Split results based on the value of a field

2003-09-04 Thread Jon Pither
Hi,

I have a requirement whereupon I'd like to pull search results back and 
split them up based on some keyword field. So for example, says there's 
a field named 'category',  I'd like to be able to have the results 
displayed as such:

Search Results for Category A:
1,
2,
3,
Search Results for Category B:
1,
2,
3.
The two ways I can think to do this are:
1) A post-process of the results collecting the first x amount of hits 
for each category.
2) Running a different search per category.

This is probably a long shot, but I was wondering if the search itself 
has the means to filter out documents based on a limit of occurances of 
a value for a given search field.  So for example if there are 5 
categories, and we only want to show 5 results per category, then the 
maximum amount of hits returned would be 25. This is because the value 
'Category A' for the field 'category' can only appear 5 times and so 
forth.  Can anyone think of a way to achieve this?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Mullti-term NOT queries

2003-09-04 Thread vesi posti
Hi,

According to the QueryParser page:

"The NOT operator cannot be used with just one term".

Is this also true for multi-term NOT queries? E.g.

NOT "jakarta apache" AND NOT "lucene"

My tests suggest so, but I'd like to hear from someone
who'd know for sure.

Also, is this a limitation of the QueryParser or of
the Query API itself?

Thanks!

Eugene.

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene app to index Java code

2003-09-04 Thread Kevin A. Burton
Otis Gospodnetic wrote:

Hello,

Has anyone written an application that uses Lucene to index Java code,
either from the source .java files, or compiled .class files?
I need to create a searchable index for Java code, so that I can use
that index to check if classes or methods with certain functionality
have already been written.  This is an effort to remove code
duplication and do more code re-use.  If this application can also
index Javadocs, even better!
I think I heard of somebody doing this already.  Kevin Burton?
 

I was playing with it... blogged about it here...

http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/

This is something that would fit nicely in Erik's Ant IndexTask in
Lucene Sandbox), I think.
 

Yes... I was thinking about making an ant task for it or using someone 
else's.  One of the cool things would be direct integration within the IDE.

Also parsing the .java file into a token stream and then indexing the 
tokens would make a blazingly fast doc completion facility

Kevin

--
Help Support NewsMonster Development!  Purchase NewsMonster PRO!
   http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread Kevin A. Burton
Otis Gospodnetic wrote:

What you describe sounds interesting, but I was thinking more along the
lines of this:
http://www.peerfear.org/rss/permalink/2003/07/23/LuceneForSourceManagement/

An application that I could use to find out whether I already have a
'getStudents' or 'getStudents*' method somewhere in the source code,
for instance, before I start writing it.  As the code base grows
larger, and as the team that works with it becomes bigger, this tools
becomes more and more valuable.
If this application could also index Javadocs, so that I can search for
methods or classes that mention +student* +(database OR db) +update,
that would be even better.
Has anyone done this?
Kevin Burton mentioned something similar to what I described above, at
that URL, but it looks like he didn't make his application available.
 

It's just two source files + Lucene plus I didn't do all the work to 
make it into an OSS package.  99% of OSS work isn't technical but 
political, maintenance, etc..

If someone wants to start an OSS project for this and do all the grunt 
work I will do the coding :)  I don't know what parser I wnat to use to 
tokenize the source but a Doclet would be perfect for this  The only 
problem is that this wouldn't allow full differential builds and would 
slow down the generation

Also it just dawned on me that the Emacs compile-internal function 
parses stdout in the form of file:line# so this would make a great way 
to integrate for us Emacs geeks.

Kevin

--
Help Support NewsMonster Development!  Purchase NewsMonster PRO!
   http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread Kevin A. Burton
Erik Hatcher wrote:

A couple of thoughts on this:

- Eclipse uses Lucene for its code indexing/searching (I learned this 
at the OSCON Keynote by Eclipse folks).  Perhaps looking at how 
Eclipse does its thing would be useful even if not the solution.

- XDoclet could be used to sweep through Java code and build a 
text/XML file as richly as you'd like from the information there 
(complete with JavaDoc tags, which Zapata will miss :)), and then run 
Lucene on the generated files.  On a related note, the XDoclet2 
architecture would streamline this even further by eliminating the 
middle textual representation (QDox/XJavadoc reads Java as a "meta 
data provider" and then a Lucene "plugin" indexes things).  It could 
be done without the intermediate text representation even in XDoclet 
1.2, but it would require coding a custom subtask and be slightly out 
of the norm for XDoclet subtasks (but would work just fine).
It would be faster to write a native doclet as this would remove the XML 
parse overhead...  The whole point of this thing is that it needs to be 
fast!

Kevin

--
Help Support NewsMonster Development!  Purchase NewsMonster PRO!
   http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread Erik Hatcher
On Thursday, September 4, 2003, at 01:30  PM, Kevin A. Burton wrote:
- XDoclet could be used to sweep through Java code and build a 
text/XML file as richly as you'd like from the information there 
(complete with JavaDoc tags, which Zapata will miss :)), and then run 
Lucene on the generated files.  On a related note, the XDoclet2 
architecture would streamline this even further by eliminating the 
middle textual representation (QDox/XJavadoc reads Java as a "meta 
data provider" and then a Lucene "plugin" indexes things).  It could 
be done without the intermediate text representation even in XDoclet 
1.2, but it would require coding a custom subtask and be slightly out 
of the norm for XDoclet subtasks (but would work just fine).
It would be faster to write a native doclet as this would remove the 
XML parse overhead...  The whole point of this thing is that it needs 
to be fast!
Do you mean the Ant build file parsing?  That would be the only XML 
parsing in the equation I'm proposing, unless you did it the clunkiest 
XDoclet 1.2 way of having an intermediate XML file.

As for speed QDox, I've heard, is the fastest option.  javadoc is 
the slowest parsing of the three I know of (javadoc, xjavadoc, qdox).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Performance of IndexWriter.addDirectory?

2003-09-04 Thread Kevin A. Burton
What's the performance of IndexWriter.addDirectory?

I assume it isn't linear but is a function of the added index.  Does the 
side of the target index matter?  What about number of documents?

Kevin

--
Help Support NewsMonster Development!  Purchase NewsMonster PRO!
   http://www.newsmonster.org/download-pro.html

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM - sfburtonator,  Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]