Re: Feature List

2003-09-02 Thread Otis Gospodnetic
Sorry, I don't think there is such a document.  It would be nice to
have it, though.  You could, however, gather a lot of information from
pages like the query syntax, various articles, etc.

Otis

--- Chris Sibert [EMAIL PROTECTED] wrote:
 Is there any document available that lists Lucene's features ? Like
 does it do similarity searches, etc. I've been using Lucene, but I
 don't know what all it can do. 
 


__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Docco 0.2 / contribution offer

2003-09-02 Thread Gregor Heinrich
Hi Peter.

Docco is a great tool which I have been using since you posted your first
announcement (version 1.0, that is). Beside the things you mention in you
mail I also generally think it's a great idea to using formal concept
analysis with Lucene. I would be interested to explore the idea also for
more structured data (maybe include fields and even hierarchies).

Apart from this, if I had an idea of the time commitments connected, I would
definitely consider to join.

Best,

Gregor



-Original Message-
From: Peter Becker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 02, 2003 1:52 PM
To: Lucene Users List
Subject: ANN: Docco 0.2 / contribution offer


Hi all,

we finally finished the 0.2 release of our little personal document
management tool based on Lucene:

  http://tockit.sourceforge.net/docco/index.html

This might be interesting for some readers of this list since its source
contains some infrastructure for document handlers and index management.
The document handlers are written with a very simple API, which just
asks the implementation to fill a structure with the information
retrieved from a URL. It is similar to the Ant task in the Lucene
sandbox, but it separates the information collection and the actual
indexing, i.e. all the decisions what should be stored and what shouldn't.

The program comes with implementations for plain text, HTML (based on
Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We
wrote plugins for POI, PDFbox and Multivalent. The latter is
unfortunately a wild hack since Multivalent is the worst Java code I've
seen. Literally. Bad C written in Java. The tool would be nice to use,
but catching exceptions in little helper classes to do a System.exit is
just insane. And that is just one of the problems -- we had to do some
bad hacks to fix these issues. The other implementations should be fine,
although they need some more testing.

The source (including all required libs) of the program is available via
Sourceforge's CVS:

  http://sourceforge.net/cvs/?group_id=37081

The module in question is called docco. A current snapshot of only the
source is here:

  http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)


The relevant packages are:

  org.tockit.docco.documenthandler: the documenthandler interface and
implementations
  org.tockit.docco.filefilter: some code to pick document handlers via
file extensions or regexps
  org.tockit.docco.index: the model/static bits of the index management
  org.tockit.docco.indexer: the dynamic aspects of the index management:
runnable, framework for handlers

The index management is probably not optimal, I strongly suspect that an
expert could tweak it. But the structure should be ok.

We would be happy to contribute this code to the Lucene sandbox if there
is interest. Or to turn it into a project of its own, we don't think it
should be hidden in our more specific program. It should be easy to
merge it with the Ant task and we are happy to give a hand if wanted.
Adding some documentation would be easy, too -- at the moment the code
is still more for ourself, but it should be very readable by itself. We
require JDK 1.4, but this can be reduced by moving some more document
handlers into plugins.

Anyone interested in joining into maintaining this code? Any feedback is
welcome.

Cheers,
   Peter


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TermVector again (Re: Luke v 0.2 - Lucene Index Browser)

2003-09-02 Thread Otis Gospodnetic

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Andrzej Bialecki wrote:
  Julien Nioche wrote:
   
 
  [- and almost impossible : recompose the unstored fields of a
 document]
  
  
  It's not impossible, just time-consuming - all information (except
 the 
  parts removed by analyzer) is already there. This functionality has
 a 
  high cool-ness factor, which makes it very tempting... :-)
 
 I had a look at the current Lucene API, and I realized that this is a
 
 very costly operation now. Now, if we had a TermVector support that
 was 
 mentioned several times on this list, things would be very
 different...
 
 Does anyone know what is the status / plans regarding this?

As far as I know it is not on the to-do list of any of the more active
Lucene developers.  In other words, it is waiting for some external
contributor with more time and with required knowledge.

Code that worked with one of the 1.2 versions was posted to the list a
looong time ago by Dmitry.

Otis


__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Keyword search with space and wildcard

2003-09-02 Thread Brian Campbell
Great.  Is there an example anywhere on how I might be able to build such a 
Query?  QueryParser isn't really all that simple since it's built with 
JavaCC.

What might be ideal for me is if I can continue to use the highlevel 
interface to build the main query (ie use it to parse my query string and 
return me some kind of Query - BooleanQuery, TermQuery, etc) and then build 
a WildcardQuery by hand and combine the two together?  For example, is it 
as simple as calling Query.combine() to combine the two?  Is there a better 
way?  Is there a documented example like this?  Thanks!

-Brian




This can be done, AFAIK.

This is one thing that many people seem unaware of: you don't HAVE to use
QueryParser to build queries. In your case it seems like you should be able
to construct query you want if you either by-pass QueryParser, or create
a dummy analyzer (one that does no tokenization but returns all input as
one token).
_
Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige 
using MSN Messenger http://entertainment.msn.com/imastar

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Keyword search with space and wildcard

2003-09-02 Thread Eric Isakson
Not sure about documented examples, but I often find the unit tests (in src/test of 
lucene's CVS) to be very useful  for examples but I didn't see any for what you are 
looking for.

Basically, query parser builds up a vector of BooleanClause objects then loops over 
those on a BooleanQuery object calling add(BooleanClause). I agree JavaCC isn't really 
simple to follow, but there is a lot of plain java in there that does the parts you 
are interested in and if you build the .java file and ignore the token parsing stuff, 
you can look at in your favorite java IDE.

What you can do is cast the query you get from QueryParser to a BooleanQuery (that is 
the only type of Query that QueryParser will return) then create your WildcardQuery or 
any other queries you need that you didn't get in the query string and add them as 
clauses to the BooleanQuery using add(Query query, boolean required, boolean 
prohibited).

I don't know how query combine works (never used it), but the javadoc comment leads me 
to believe it is not what you are looking for and a bit of poking around in the 
sources gives me the same impression.

Eric 

-Original Message-
From: Brian Campbell [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 02, 2003 11:05 AM
To: [EMAIL PROTECTED]
Subject: Re: Keyword search with space and wildcard


Great.  Is there an example anywhere on how I might be able to build such a 
Query?  QueryParser isn't really all that simple since it's built with 
JavaCC.

What might be ideal for me is if I can continue to use the highlevel 
interface to build the main query (ie use it to parse my query string and 
return me some kind of Query - BooleanQuery, TermQuery, etc) and then build 
a WildcardQuery by hand and combine the two together?  For example, is it 
as simple as calling Query.combine() to combine the two?  Is there a better 
way?  Is there a documented example like this?  Thanks!

-Brian





This can be done, AFAIK.

This is one thing that many people seem unaware of: you don't HAVE to 
use QueryParser to build queries. In your case it seems like you should 
be able to construct query you want if you either by-pass QueryParser, 
or create a dummy analyzer (one that does no tokenization but returns 
all input as one token).


_
Enter for your chance to IM with Bon Jovi, Seal, Bow Wow, or Mary J Blige 
using MSN Messenger http://entertainment.msn.com/imastar


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



One direction phrase searches

2003-09-02 Thread Joe Paulsen
It seems when I do a search such as covered wagon ~5 or the like,
the systems disregards the order of my terms.  I.E., it will find covered
within 5 of wagon and it will also find wagon within 5 of covered.

Is there anyway to make the system respond only to the order of the
terms as entered in the query?

Joe Paulsen

Re: One direction phrase searches

2003-09-02 Thread Erik Hatcher
On Tuesday, September 2, 2003, at 04:11  PM, Joe Paulsen wrote:
It seems when I do a search such as covered wagon ~5 or the like,
the systems disregards the order of my terms.  I.E., it will find 
covered
within 5 of wagon and it will also find wagon within 5 of covered.
I wanted to see this in action myself, so I coded up a small unit test:

public void testOrderDoesntMatter() throws Exception {
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new 
WhitespaceAnalyzer(), true);
Document doc = new Document();
doc.add(Field.Text(field, one two));
writer.addDocument(doc);
writer.optimize();
writer.close();

IndexSearcher searcher = new IndexSearcher(directory);
PhraseQuery query = new PhraseQuery();
query.setSlop(5);
query.add(new Term(field, two));
query.add(new Term(field, one));
Hits hits = searcher.search(query);
assertEquals(1, hits.length());
searcher.close();
}
Notice that I'm searching for two one~5 (yet indexed one two) and 
it found 1 hit.

And then, like a typical programmer, I looked at the Javadocs *after* 
coding :) and found this on PhraseQuery:

  /** Sets the number of other words permitted between words in query 
phrase.
If zero, then this is an exact phrase search.  For larger values 
this works
like a codeWITHIN/code or codeNEAR/code operator.

pThe slop is in fact an edit-distance, where the units correspond 
to
moves of terms in the query phrase out of position.  For example, 
to switch
the order of two words requires two moves (the first move places 
the words
atop one another), so to permit re-orderings of phrases, the slop 
must be
at least two.

pMore exact matches are scored higher than sloppier matches, thus 
search
results are sorted by exactness.

pThe slop is zero by default, requiring exact matches.*/
  public void setSlop(int s) { slop = s; }
So what you observe is the correct documented behavior.

Is there anyway to make the system respond only to the order of the
terms as entered in the query?
I'm sure there is a way to make an OrderedPhraseQuery, although I'll 
need to do some more homework myself to craft such a thing.  All the 
information to do such a thing is available, although maybe it wouldn't 
be as performant as PhraseQuery (just a guess, no facts to back that up 
yet).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Default queries to using And

2003-09-02 Thread Scott Farquhar
Is it possible to get QueryParser.parse() to parse queries defaulting to
'AND' rather than 'OR'?

Currently if you search for 'A B' that is the same as 'A OR B'.  What I
would like is to default to 'A AND B'.

Apologies for the simple question.  I'm guessing the answer is probably more complex
(like Lucene returning A+B, then A, then B)?  I couldn't find anything in
the FAQ about this.

Cheers,
Scott

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Default queries to using And

2003-09-02 Thread Scott Farquhar
This is what I'm looking for.  Too easy.

Cheers,
Scott

On Tue, Sep 02, 2003 at 09:24:04PM -0400, Erik Hatcher wrote:
 Look at QueryParser.setOperator() (perhaps it was added after your 
 version?)
 
   Erik
 
 
 On Tuesday, September 2, 2003, at 09:10  PM, Scott Farquhar wrote:
 
  Is it possible to get QueryParser.parse() to parse queries defaulting 
  to
  'AND' rather than 'OR'?
 
  Currently if you search for 'A B' that is the same as 'A OR B'.  What I
  would like is to default to 'A AND B'.
 
  Apologies for the simple question.  I'm guessing the answer is 
  probably more complex
  (like Lucene returning A+B, then A, then B)?  I couldn't find anything 
  in
  the FAQ about this.
 
  Cheers,
  Scott
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: One direction phrase searches

2003-09-02 Thread Erik Hatcher
Because I'm really interested in the guts of Lucene, I dug even 
deeper

On Tuesday, September 2, 2003, at 07:39  PM, Erik Hatcher wrote:
Is there anyway to make the system respond only to the order of the
terms as entered in the query?
I'm sure there is a way to make an OrderedPhraseQuery, although I'll 
need to do some more homework myself to craft such a thing.  All the 
information to do such a thing is available, although maybe it 
wouldn't be as performant as PhraseQuery (just a guess, no facts to 
back that up yet).
PhraseQuery uses a SloppyPhraseScorer, and its phaseFreq method is what 
makes the order not matter.  I'm pretty sure a new OrderedPhraseQuery 
that subclassed PhraseQuery and overrode createWeight and did something 
similar to the SloppyPhraseScorer would do the trick.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene features

2003-09-02 Thread Chris Sibert
I am wondering if Lucene is the way to go for my project. I don't know what
other search engines are available out there, and how Lucene stacks up
against them. I am wondering if Lucene has a full set of searching features,
comparable to what I might find in a reasonably priced commercial package.
Anyone with a solid knowledge of Lucene care to make me feel warm and fuzzy
about my decision so far to use Lucene ?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene features

2003-09-02 Thread Steven J. Owens
On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote:
 I am wondering if Lucene is the way to go for my project.

 Probably.  Tell us a little about your project.

 I don't know what other search engines are available out there,

 Lucene isn't a search engine _application_, it's a search engine
_API_.  Lucene gives you what you need in order to build the search
engine you want, instead of spending gobs of time trying to figure out
the 10,000 options available for a search engine application, or
trying to warp somebody else's ideas of what you need to meet what you
really need.

 and how Lucene stacks up against them.

 Pretty well, if you're willing to put a (very) little time and
energy into to building the application you need.  I know.  I've done
it.

 I am wondering if Lucene has a full set of searching features,
 comparable to what I might find in a reasonably priced commercial
 package.

 There is no comparison :-).  Lucene is a fundamentally decent
piece of technology.  This puts it head and shoulders above most
commercial packages.

 Specifically, the Lucene search engine API is blindingly fast at
searching and at indexing, and comes with several built-in packages to
provide several of the commonly needed functions (like a web search
engine style query language parser).  

 Additionally, a wide variety of people have been down this road
and done a wide variety of things with Lucene, so you're likely to be
able to find examples, in the Lucene sandbox or in the lucene-user
archives, of how to do whatever it is you want to do.

 Anyone with a solid knowledge of Lucene care to make me feel warm
 and fuzzy about my decision so far to use Lucene ?

 Tell us a little more about your project requirements and I'll
tell you enough specifics to give you a warm and fuzzy feeling.
Lucene isn't perfect for _everything_ (and anybody who claims that a
given technology *is* perfect for _everything_ is lying).  But it's
quite good for a number of things.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]