Re: Feature List
Sorry, I don't think there is such a document. It would be nice to have it, though. You could, however, gather a lot of information from pages like the query syntax, various articles, etc. Otis --- Chris Sibert [EMAIL PROTECTED] wrote: Is there any document available that lists Lucene's features ? Like does it do similarity searches, etc. I've been using Lucene, but I don't know what all it can do. __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Docco 0.2 / contribution offer
Hi Peter. Docco is a great tool which I have been using since you posted your first announcement (version 1.0, that is). Beside the things you mention in you mail I also generally think it's a great idea to using formal concept analysis with Lucene. I would be interested to explore the idea also for more structured data (maybe include fields and even hierarchies). Apart from this, if I had an idea of the time commitments connected, I would definitely consider to join. Best, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 02, 2003 1:52 PM To: Lucene Users List Subject: ANN: Docco 0.2 / contribution offer Hi all, we finally finished the 0.2 release of our little personal document management tool based on Lucene: http://tockit.sourceforge.net/docco/index.html This might be interesting for some readers of this list since its source contains some infrastructure for document handlers and index management. The document handlers are written with a very simple API, which just asks the implementation to fill a structure with the information retrieved from a URL. It is similar to the Ant task in the Lucene sandbox, but it separates the information collection and the actual indexing, i.e. all the decisions what should be stored and what shouldn't. The program comes with implementations for plain text, HTML (based on Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We wrote plugins for POI, PDFbox and Multivalent. The latter is unfortunately a wild hack since Multivalent is the worst Java code I've seen. Literally. Bad C written in Java. The tool would be nice to use, but catching exceptions in little helper classes to do a System.exit is just insane. And that is just one of the problems -- we had to do some bad hacks to fix these issues. The other implementations should be fine, although they need some more testing. The source (including all required libs) of the program is available via Sourceforge's CVS: http://sourceforge.net/cvs/?group_id=37081 The module in question is called docco. A current snapshot of only the source is here: http://tockit.sourceforge.net/docco/source20030902.zip (~100kb) The relevant packages are: org.tockit.docco.documenthandler: the documenthandler interface and implementations org.tockit.docco.filefilter: some code to pick document handlers via file extensions or regexps org.tockit.docco.index: the model/static bits of the index management org.tockit.docco.indexer: the dynamic aspects of the index management: runnable, framework for handlers The index management is probably not optimal, I strongly suspect that an expert could tweak it. But the structure should be ok. We would be happy to contribute this code to the Lucene sandbox if there is interest. Or to turn it into a project of its own, we don't think it should be hidden in our more specific program. It should be easy to merge it with the Ant task and we are happy to give a hand if wanted. Adding some documentation would be easy, too -- at the moment the code is still more for ourself, but it should be very readable by itself. We require JDK 1.4, but this can be reduced by moving some more document handlers into plugins. Anyone interested in joining into maintaining this code? Any feedback is welcome. Cheers, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TermVector again (Re: Luke v 0.2 - Lucene Index Browser)
--- Andrzej Bialecki [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Julien Nioche wrote: [- and almost impossible : recompose the unstored fields of a document] It's not impossible, just time-consuming - all information (except the parts removed by analyzer) is already there. This functionality has a high cool-ness factor, which makes it very tempting... :-) I had a look at the current Lucene API, and I realized that this is a very costly operation now. Now, if we had a TermVector support that was mentioned several times on this list, things would be very different... Does anyone know what is the status / plans regarding this? As far as I know it is not on the to-do list of any of the more active Lucene developers. In other words, it is waiting for some external contributor with more time and with required knowledge. Code that worked with one of the 1.2 versions was posted to the list a looong time ago by Dmitry. Otis __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Keyword search with space and wildcard
Great. Is there an example anywhere on how I might be able to build such a Query? QueryParser isn't really all that simple since it's built with JavaCC. What might be ideal for me is if I can continue to use the highlevel interface to build the main query (ie use it to parse my query string and return me some kind of Query - BooleanQuery, TermQuery, etc) and then build a WildcardQuery by hand and combine the two together? For example, is it as simple as calling Query.combine() to combine the two? Is there a better way? Is there a documented example like this? Thanks! -Brian This can be done, AFAIK. This is one thing that many people seem unaware of: you don't HAVE to use QueryParser to build queries. In your case it seems like you should be able to construct query you want if you either by-pass QueryParser, or create a dummy analyzer (one that does no tokenization but returns all input as one token). _ Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige using MSN Messenger http://entertainment.msn.com/imastar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Keyword search with space and wildcard
Not sure about documented examples, but I often find the unit tests (in src/test of lucene's CVS) to be very useful for examples but I didn't see any for what you are looking for. Basically, query parser builds up a vector of BooleanClause objects then loops over those on a BooleanQuery object calling add(BooleanClause). I agree JavaCC isn't really simple to follow, but there is a lot of plain java in there that does the parts you are interested in and if you build the .java file and ignore the token parsing stuff, you can look at in your favorite java IDE. What you can do is cast the query you get from QueryParser to a BooleanQuery (that is the only type of Query that QueryParser will return) then create your WildcardQuery or any other queries you need that you didn't get in the query string and add them as clauses to the BooleanQuery using add(Query query, boolean required, boolean prohibited). I don't know how query combine works (never used it), but the javadoc comment leads me to believe it is not what you are looking for and a bit of poking around in the sources gives me the same impression. Eric -Original Message- From: Brian Campbell [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 02, 2003 11:05 AM To: [EMAIL PROTECTED] Subject: Re: Keyword search with space and wildcard Great. Is there an example anywhere on how I might be able to build such a Query? QueryParser isn't really all that simple since it's built with JavaCC. What might be ideal for me is if I can continue to use the highlevel interface to build the main query (ie use it to parse my query string and return me some kind of Query - BooleanQuery, TermQuery, etc) and then build a WildcardQuery by hand and combine the two together? For example, is it as simple as calling Query.combine() to combine the two? Is there a better way? Is there a documented example like this? Thanks! -Brian This can be done, AFAIK. This is one thing that many people seem unaware of: you don't HAVE to use QueryParser to build queries. In your case it seems like you should be able to construct query you want if you either by-pass QueryParser, or create a dummy analyzer (one that does no tokenization but returns all input as one token). _ Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige using MSN Messenger http://entertainment.msn.com/imastar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
One direction phrase searches
It seems when I do a search such as covered wagon ~5 or the like, the systems disregards the order of my terms. I.E., it will find covered within 5 of wagon and it will also find wagon within 5 of covered. Is there anyway to make the system respond only to the order of the terms as entered in the query? Joe Paulsen
Re: One direction phrase searches
On Tuesday, September 2, 2003, at 04:11 PM, Joe Paulsen wrote: It seems when I do a search such as covered wagon ~5 or the like, the systems disregards the order of my terms. I.E., it will find covered within 5 of wagon and it will also find wagon within 5 of covered. I wanted to see this in action myself, so I coded up a small unit test: public void testOrderDoesntMatter() throws Exception { Directory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); Document doc = new Document(); doc.add(Field.Text(field, one two)); writer.addDocument(doc); writer.optimize(); writer.close(); IndexSearcher searcher = new IndexSearcher(directory); PhraseQuery query = new PhraseQuery(); query.setSlop(5); query.add(new Term(field, two)); query.add(new Term(field, one)); Hits hits = searcher.search(query); assertEquals(1, hits.length()); searcher.close(); } Notice that I'm searching for two one~5 (yet indexed one two) and it found 1 hit. And then, like a typical programmer, I looked at the Javadocs *after* coding :) and found this on PhraseQuery: /** Sets the number of other words permitted between words in query phrase. If zero, then this is an exact phrase search. For larger values this works like a codeWITHIN/code or codeNEAR/code operator. pThe slop is in fact an edit-distance, where the units correspond to moves of terms in the query phrase out of position. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two. pMore exact matches are scored higher than sloppier matches, thus search results are sorted by exactness. pThe slop is zero by default, requiring exact matches.*/ public void setSlop(int s) { slop = s; } So what you observe is the correct documented behavior. Is there anyway to make the system respond only to the order of the terms as entered in the query? I'm sure there is a way to make an OrderedPhraseQuery, although I'll need to do some more homework myself to craft such a thing. All the information to do such a thing is available, although maybe it wouldn't be as performant as PhraseQuery (just a guess, no facts to back that up yet). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Default queries to using And
Is it possible to get QueryParser.parse() to parse queries defaulting to 'AND' rather than 'OR'? Currently if you search for 'A B' that is the same as 'A OR B'. What I would like is to default to 'A AND B'. Apologies for the simple question. I'm guessing the answer is probably more complex (like Lucene returning A+B, then A, then B)? I couldn't find anything in the FAQ about this. Cheers, Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Default queries to using And
This is what I'm looking for. Too easy. Cheers, Scott On Tue, Sep 02, 2003 at 09:24:04PM -0400, Erik Hatcher wrote: Look at QueryParser.setOperator() (perhaps it was added after your version?) Erik On Tuesday, September 2, 2003, at 09:10 PM, Scott Farquhar wrote: Is it possible to get QueryParser.parse() to parse queries defaulting to 'AND' rather than 'OR'? Currently if you search for 'A B' that is the same as 'A OR B'. What I would like is to default to 'A AND B'. Apologies for the simple question. I'm guessing the answer is probably more complex (like Lucene returning A+B, then A, then B)? I couldn't find anything in the FAQ about this. Cheers, Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: One direction phrase searches
Because I'm really interested in the guts of Lucene, I dug even deeper On Tuesday, September 2, 2003, at 07:39 PM, Erik Hatcher wrote: Is there anyway to make the system respond only to the order of the terms as entered in the query? I'm sure there is a way to make an OrderedPhraseQuery, although I'll need to do some more homework myself to craft such a thing. All the information to do such a thing is available, although maybe it wouldn't be as performant as PhraseQuery (just a guess, no facts to back that up yet). PhraseQuery uses a SloppyPhraseScorer, and its phaseFreq method is what makes the order not matter. I'm pretty sure a new OrderedPhraseQuery that subclassed PhraseQuery and overrode createWeight and did something similar to the SloppyPhraseScorer would do the trick. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene features
I am wondering if Lucene is the way to go for my project. I don't know what other search engines are available out there, and how Lucene stacks up against them. I am wondering if Lucene has a full set of searching features, comparable to what I might find in a reasonably priced commercial package. Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy about my decision so far to use Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene features
On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote: I am wondering if Lucene is the way to go for my project. Probably. Tell us a little about your project. I don't know what other search engines are available out there, Lucene isn't a search engine _application_, it's a search engine _API_. Lucene gives you what you need in order to build the search engine you want, instead of spending gobs of time trying to figure out the 10,000 options available for a search engine application, or trying to warp somebody else's ideas of what you need to meet what you really need. and how Lucene stacks up against them. Pretty well, if you're willing to put a (very) little time and energy into to building the application you need. I know. I've done it. I am wondering if Lucene has a full set of searching features, comparable to what I might find in a reasonably priced commercial package. There is no comparison :-). Lucene is a fundamentally decent piece of technology. This puts it head and shoulders above most commercial packages. Specifically, the Lucene search engine API is blindingly fast at searching and at indexing, and comes with several built-in packages to provide several of the commonly needed functions (like a web search engine style query language parser). Additionally, a wide variety of people have been down this road and done a wide variety of things with Lucene, so you're likely to be able to find examples, in the Lucene sandbox or in the lucene-user archives, of how to do whatever it is you want to do. Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy about my decision so far to use Lucene ? Tell us a little more about your project requirements and I'll tell you enough specifics to give you a warm and fuzzy feeling. Lucene isn't perfect for _everything_ (and anybody who claims that a given technology *is* perfect for _everything_ is lying). But it's quite good for a number of things. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]