Re: Lucene vs. in-DB-full-text-searching

2005-02-19 Thread Steven J. Owens
On Fri, Feb 18, 2005 at 04:45:50PM -0500, Mike Rose wrote:
 I can comment on this since I'm in the middle of excising Oracle text
 searching and replacing it with Lucene in one of my projects.

 Intereseting, particularly as it's from somebody who's already
tried an existing in-db fulltext search feature.

 All in all, I don't think that a JDBC wrapper is going to do what
 you want.

 I wasn't thinking about trying to do the whole thing under the
JDBC driver.  Mainly I was thinking that one key point is that you
need to treat the lucene index somewhat like a cache.  This also means
that you have to watch database writes and make sure you update your
cache, which means you have to have some sort of single point of data
access to monitor.  Well, we already have that - it's called the JDBC
driver.

 The general design I was eyeing speculatively is basically that
the driver would be set up with a reference to an object that
implements a CacheManager interface.  This interface basically gives
the driver a way to notify the cache manager of when certain tables
and columns are being edited.  Exactly how is another question.  I
don't know enough of the innards of, say, a PreparedStatement, to say
more.  It could be as simple as sending the CacheManager a copy of
every SQL query string and letting the CacheManager figure out the
rest.  Ideally I'd like it to be a little bit more structured.

 From there, it's the CacheManager's job to decide what to do
about it, and how to do it.  This leaves the tricky issue of mapping
from a specific database to a specific lucene index up to the
developer.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - http://darksleep.com/notablog


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Steven J. Owens
Hi,

 I was rambling to some friends about an idea to build a
cache-aware JDBC driver wrapper, to make it easier to keep a lucene
index of a database up to date.

 They asked me a question that I have to take seriously, which is
that most RDBMSes provide some built-in fulltext searching - postgres,
mysql, even oracle - why not use that instead of adding another layer
of caching?

 I have to take this question seriously, especially since it
reminds me a lot of what Doug has often said to folks contemplating
doing similar things (caching query results, etc) with Lucene.

 Has anybody done some serious investigation into this, and could
summarize the pros and cons?

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - http://darksleep.com/notablog


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Collaborative Filtering API

2003-11-26 Thread Steven J. Owens
On Tue, Nov 25, 2003 at 01:18:19PM -0500, Michael Giles wrote:
 Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota.

 I've actually worked on a system that bundled GroupLens.  I think
it was Vignette StoryServer.  The Vignette docs were incredibly dense
with MarketingNewSpeak, so I could never quite figure out what they
said GroupLens actually *did* (not at a web-capable terminal right
now, or I'd just google it).

 Collaborative filtering in general is a topic I'm interested in,
and is why I first got into Lucene.  I wanted and still want to build
a collaborative filtering search engine for mailing lists and the
like.

 I do remember that FireFly's engine was supposed to graph all of
the users' ratings on a topic in an N-dimensional space, and then find
users close to the same user in that N-dimensional space, and
suggest topics that they'd liked, but that the current user hadn't
rated.

 I'm interested in more of a free market sort of approach than
in statistical analysis; I want to build a system that helps usrs
express their opinions, then nurture an emerging consensus.  My
experience has been that systems that systems/technologies that try to
facilitate the way users already do things, instead of replacing them
with new ways of doing things, tend to work better.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-23 Thread Steven J. Owens
On Wed, Sep 17, 2003 at 08:00:42AM -0400, Erik Hatcher wrote:
 I'm about to start some refactorings on the web application demo that 
 ships with Lucene to show off its features and be usable more easily 
 and cleanly out of the box - i.e. just drop into Tomcat's webapps 
 directory and go.
 
 Does anyone have any suggestions on what they'd like to see in the demo 
 app?

 One odd thought (may be out of scope) is to put together a
google-flavored query language, since most users are going to be
unfamiliar with the default Lucene query language.  Lucene doesn't
really match google, but something google-flavored might be better at
showing off Lucene's features in the demo.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene features

2003-09-04 Thread Steven J. Owens
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
 Lucene Users List [EMAIL PROTECTED]
   I am wondering if Lucene is the way to go for my project.
   Probably.  Tell us a little about your project.
 
 It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
 in size. They don't ever change, and are on a CD-ROM. Each file contains a
 bunch of small documents. I just create one index for all 4 of them. These
 documents are for an association that I belong to - they contain a history
 of the association's documents - and my application allows you to search
 them.

 Well, aside from your concerns about the second list, Lucene
seems perfect for your needs.  You'd parse apart the four big files
into a bunch of small documents, the parse those small documents and
create lucene Documents, containing Fields, and add them to the index.
 
 They are actually currently indexed by an application called
 'Sonar', by Virginia Systems. But I REALLY didn't like using their
 user interface - blech - so I decided to write a new interface for
 my own use. But Sonar costs some real bucks to be able to develop
 against their search API, so I found Lucene, and decided to go with
 it.
 
 Here are the search features that 'Sonar' has :
   Boolean Searching
   Proximity Searching
   Wild Card Searching
   Field/Block Searching

 I'm not sure what Field/Block means.  Boolean, Proximity and
WildCard, are pretty typical in Lucene searches.  You should probably
take a look at the Query Parser syntax docs:

 http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


   Relevancy Ranking / Date Ranking

 Lucene search results are typically ranked by relevance, and you
can tweak the search to adjust this (there's a fair bit of discussion
of this in the lucene-user archives, a good keyword to look for is
slop and boost).

 Sorting output by date might take some finesse.  I haven't played
with sorting by date, but I'd expect to handle that by directly
instantiating a QueryTerm to indicate the date issues.

   List of Occurrences in Context

 I assume here that you mean displaying the results with a little
snapshot of the text around it.  There have been discussions about how
best to do this (often focused around highlighting the search terms in
the displayed text) on the lucene-users list.  Check the list archive.
 
   Phonetic Searching

 I'd guess you need to build this one yourself, perhaps by using a
soundex algorithm when indexing the original data files.

   Synonyms/Concepts

 Likewise... you'd need to come up with some sort of ontology of
synonyms and concepts, then parse the fields you're indexing and
generate a synonym/concept field that you'd add to the lucene
Document.

   Relational Searching
   Associated Words
   Drill Down Search Narrowing

 I'm not sure what these three mean.

 I think that Lucene has all the features in the first group. How does it
 stack up against the second group ?

 I'm afraid I haven't been too helpful here.  Perhaps if you
clarify what the above mean, folks can post about how to implement it
in Lucene.

 I'm writing the whole thing in Swing, which has been time consuming,
 and so have invested quite a bit of time into this project. But I'm
 seeing the end of the tunnel, and want to make sure that I'm going
 down the right path before I spend too much more time on it.

 It sounds like you ought to at least seriously consider using
Lucene, if you can find or implement equivalent features, or decide
you can live without them.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene features

2003-09-02 Thread Steven J. Owens
On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote:
 I am wondering if Lucene is the way to go for my project.

 Probably.  Tell us a little about your project.

 I don't know what other search engines are available out there,

 Lucene isn't a search engine _application_, it's a search engine
_API_.  Lucene gives you what you need in order to build the search
engine you want, instead of spending gobs of time trying to figure out
the 10,000 options available for a search engine application, or
trying to warp somebody else's ideas of what you need to meet what you
really need.

 and how Lucene stacks up against them.

 Pretty well, if you're willing to put a (very) little time and
energy into to building the application you need.  I know.  I've done
it.

 I am wondering if Lucene has a full set of searching features,
 comparable to what I might find in a reasonably priced commercial
 package.

 There is no comparison :-).  Lucene is a fundamentally decent
piece of technology.  This puts it head and shoulders above most
commercial packages.

 Specifically, the Lucene search engine API is blindingly fast at
searching and at indexing, and comes with several built-in packages to
provide several of the commonly needed functions (like a web search
engine style query language parser).  

 Additionally, a wide variety of people have been down this road
and done a wide variety of things with Lucene, so you're likely to be
able to find examples, in the Lucene sandbox or in the lucene-user
archives, of how to do whatever it is you want to do.

 Anyone with a solid knowledge of Lucene care to make me feel warm
 and fuzzy about my decision so far to use Lucene ?

 Tell us a little more about your project requirements and I'll
tell you enough specifics to give you a warm and fuzzy feeling.
Lucene isn't perfect for _everything_ (and anybody who claims that a
given technology *is* perfect for _everything_ is lying).  But it's
quite good for a number of things.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Apache Wiki at nagoya.apache.org

2002-12-21 Thread Steven J. Owens
Hey folks,

 They've put an apache wiki at nagoya.  I took the liberty of
writing a paragraph about Lucene, please feel free to completely
revise it :-).

http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneProjectPages

Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Let me get started

2002-11-13 Thread Steven J. Owens
On Wed, Nov 13, 2002 at 05:14:25PM +0530, Uma Maheswar wrote:
 Thanks for the messages. Yes I wanted to index .jsp files also. Is
 it possible?

 It's possible, but you'll need to write code to select and parse
the jsp files.  There may be code in the sandbox area at
jakarta.apache.org/lucene for doing this, though I don't see it.

 I thought we need a database to store some values and then retrive
 them back. Dont we need database for it?

 Nope, lucene stores search data in its own files.  You can easily
use lucene to build a search engine for data that's stored in a
database, but you don't need a database.

Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org




Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

2002-04-29 Thread Steven J. Owens

petite,

On Mon, Apr 29, 2002 at 07:54:43PM +0200, petite_abeille wrote:
 As a final note, several people suggested to increase the number of file 
 descriptors per process with something like ulimit...

 Just be glad you aren't doing this on Solaris with JDK 1.1.6,
where I first ran into ulimit issues - back when I encountered this
problem, the solaris default ulimit setting was 24 files, and JDK
1.1.6 reported the problem as an OutOfMemory error!  Looks like
things are improving :-).

 From what I learned today, I think it's a *bad* idea to have to
 change some system parameters just because your/my app is written in
 such a way that it may run out of some system resources. Your/my app
 has to fit in the system.  Hacking ulimit and/or other system
 parameters is just a quick patch that will -at best- delay dealing
 with the real problem that's usually one of design.

 Yes and no.  Setting ulimit to a reasonable number of open files
is not only not a patch, it's the right way to do it.  I understand
where you're coming from, really, and in a certain way, it makes
sense, BUT... sometimes the impulse for clean, good design takes you
too far down a blind alley.  Sometimes there is no elegant solution.
Sometimes there is no best way, only one of a limited set of options
with different tradeoffs.

 By definition, Lucene is an application that trades off up front
CPU (for indexing) and file resources (for storage) for request-time
speed.  The OS's job is to manage resources, and open files are one of
those resources.  That's the tradeoff here, and it's reasonable and
expected.  Most serious applications have to have some sort of OS
variable tweaking, you're just used to having it done invisibly and
painlessly.

 That said, since you're working on a client/desktop application,
not a server application, you need to think about ways to handle this:

 You could figure out the right way to set the system
configuration on install or launch.

 You could look at the alternative techniques for indexing in
Lucene, and see if any approaches there can help - for example, maybe
doing a lot of the more intense indexing work in a RAMDirectory, then
merging it into a normal file-based Directory.

 You could look more closely at what your application is doing,
and see if there's anything you're doing wrong (perhaps opening files
and not closing them, and leaving them for the garbage collector to
eventually get around to closing?) or if you have a pessimal usage
pattern that exacerbates the situation.

 You could take a closer look at the lucene indexing and file
management stuff, and see if you can come up with a scheme to run
Lucene indexing with modified code for keeping track of file
resources. 

 I'll bet Doug and the other developers would rather not add
open-file managmeent as a main, permanent part of lucene, since it
would add overhead to all uses of lucene just to deal with an
anomalous situation (use on a client/desktop machine).  But they might
be interested in a way to offer it as an optional feature, where
people using lucene in a constrained environment could configure
lucene to be careful about how many files it keeps open at any given
time.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: rc4 and FileNotFoundException: an update

2002-04-27 Thread Steven J. Owens


On Fri, Apr 26, 2002 at 07:05:23PM +0200, petite_abeille wrote:
 I guess it's really not my day...
 [...]
 Well, it's pretty ugly. Whatever I'm doing with Lucene in the previous 
 package (com.lucene) is magnified many folds in rc4. After processing a 
 paltry 16 objects I got:
 
 SZFinder.findObjectsWithSpecificationInStore: 
 java.io.FileNotFoundException: _2.f14 (Too many open files)

 Sounds like a pretty nasty situation.  

 One suggestion I have for you is that Doug is usually very
helpful with problems like this IF you can first narrow down what is
happening to the point that you can post a clear, specific, isolated
test that consistently causes the problem to happen.  This makes sense
- any effort to solve the problem will first involve isolating the
bug, and that's a task you're best suited for, since you know your
system best.

 So maybe your best approach would be to take a copy of your
system as above, and start gradually stripping out stuff, testing
between each run, until you have most of the application-specific
stuff removed, but the problem is still reoccurring consistently.
Then post your code and ask if some of the more lucene-knowledgable
can take a look.

 Re: index integrity, I agree that it would be really, really nice
to have some sort of sanity check.  I have yet to actually get into
the internals of the index, but I'd guess that there must be some sort
of at least superficial check, maybe some sort of format check.  

 If I was going to kludge something together, the first approach
I'd take would be to just open the index and roll through all of the
Documents in it, accessing all of the fields (or maybe just a few main
fields per Document).  Im not sure what I'd *do* with the field
values (printing them out to the screen might take a while), other
than perhaps checking for nulls.  But I suspect that if the code gets
throught that without causing an exception or getting null values,
then at least the index's internal format is intact.  Maybe the test
code could save the number of lucene Document objects in the index in
between checks (and, of course, update this number when you add or
remove documents), and make sure it still has the right number of
documents.

 As for repairing an index, I think that's working sort of against
the grain of Lucene.  In your case, it sounds like rebuilding the
index is important, because you're using Lucene as a data store.  I
have some similar issues myself in some things I want to build (I end
up wanting both a data store and a search index; ultimately I've ended
up choosing to have a separate data store for the extra data).  But
Lucene is a search index, meant to be used more in a cache-like style,
so there's an underlying assumption that the original data is always
around to reindex.  Thus, repairing an index is less important, since
it is assumed you can always rebuild it.  

 I don't know much of the theories behind data store systems.  It
occurs to me that using Lucene as a data store, you'll always be
working against the grain, always swimming upstream.  Maybe it'd be a
better idea to figure out some way to use Lucene as the indexing
technology in a data store, the way traditional RDBMSes use indexes,
for speeding access.  

 Or possibly you should look at Xindice (http://xml.apache.org/xindice/)
which is an XML database.  You might find it easier to adapt that to your
needs.  I'm kind of curious as to how fast Xindice's XPath execution is, and
what their indexing is based on - there might be a use for Lucene there.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Relevance Feedback

2002-03-29 Thread Steven J. Owens

On Fri, Mar 29, 2002 at 12:11:03PM -0800, Joshua O'Madadhain wrote:
 On Fri, 29 Mar 2002, Nathan G. Freier wrote:
  I'm just beginning to plan out some mechanisms for query expansion
  and relevance feedback.
 
 I've also been doing research in IR using the Lucene API.
 [...]
 If you'd like to discuss this offline (since we may be getting off the
 list topic), feel free to email me.

 I'm curious about this topic, although I have absolutely no
familiarity with it (beyond reading, many years ago, about the real
estate browsing UI experiment where they let users click on
inappropriate listings and refined the search based on that - a
feature I've often wished for with web search engines).

 If you could either include me in the CC list, or send me a
summary, or possibly (if others are also interested), continue the
discussion here, I'd appreciate it.

Steven J. Owens
[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Question on the FAQ list with filters

2002-03-27 Thread Steven J. Owens

On Wed, Mar 27, 2002 at 03:52:21PM -0600, Armbrust, Daniel C. wrote:
 From the FAQ:
 16. What is filtering and how is it performed ?
 * Search Query - in this approach, provide your custom filter object to the
 when you call the search() method. This filter will be called exactly once
 to evaluate every document that resulted in non zero score.
 * Selective Collection - in this approach you perform the regular search and
 when you get back the hit list, collect only those that matches your
 filtering criteria. In this approach, your filter is called only for hits
 that returned by the search method which may be only a subset of the non
 zero matches (useful when evaluating your search filter is expensive). 
 
 ***
 
 I don't see why the second way is useful.  Yes, your filter is called only
 for hits that got returned by the search method, but aren't those the same
 hits that the search() method would run through the filter?  Maybe I'm just
 not reading it close enough.
 
 Is my assumption that it is faster to provide a filter to the search()
 method, than to do a selective collation correct?  

 It Depends.  That's more or less the point of the FAQ answer,
though it could be more clearly expressed.  The gist of the FAQ seems
to be that you can either do the filtering BEFORE you do the search,
or AFTER you do the search.

 Obviously the question is, which is more expensive, filtering out
inappropriate documents, or searching for the possible hits?  If
filtering is cheaper, you do the filtering first, then do the search.
If filtering is expensive, you do the search first, then do the
filtering.  You should also factor in which is more restrictive - will
either the filter or the search drop out a large number of the
documents?  If you can arrange it so one is both cheaper and drops out
the majority of the documents, you win.

 In either case, you implement some sort of object which you can
hand a org.apache.lucene.TermDocs and get back a yes or no as to
whether it's a valid possible search result.

 From looking at the source for:

 org.apache.lucene.search.Filter,
 org.apache.lucene.search.DateFilter, and
 org.apache.lucene.search.IndexSearcher, 

 ...it appears that you instantiate your Filter subclass, then for
filtering BEFORE the search, you pass YourFilter an IndexReader and
get back a BitSet.  Or more to the point, when you invoke
IndexSearcher.search(), you pass it YourFilter, and a HitsCollector,
and IndexSearcher.search() gets the BitSet from YourFilter.  

 A BitSet, from the JDK API, is a vector of bit values (i.e. 1 or
0, corresponding to the java boolean values true and false).

 It appears, from looking at the source, that each Bit in the
BitSet corresponds to an SearchIndex TermDoc at the same sequential
location in the SearchIndex.  IndexSearcher.search() has an inner
class (this is a bit ambiguous and it's been a year since I've lookd
at inner classes, so I'm going to just handwave and move along :-)
with a collect() method that loops through the termDocs, skipping the
ones for which BitSet.get() returns false.

 I'm not sure exactly how you would use an
org.apache.lucene.search.Filter to do the filtering AFTER, but
presumably that would involve just handing it the TermDocs in
question, or maybe IndexReader and Hits both implement a common
interface... uhm, no, that's not it.  Well, I guess you use your own
class for the filter.  That's what I ended up doing anyway, in my
ignorance of the Filter abstract class.  I ended up doing my filtering
AFTER, btw, because it involved some expensive lookups in other
documents.

 There's actually a third option, figure out a way to implement
your filter as an additional boolean phrase on your search.  However,
that may or may not be feasible, or the Lucene Filter mechanism may
not have been intended to address such cases.  

 To be honest, the design of the Filter seems less
well-thought-out than the rest of Lucene, like it's an afterthought.
I really oughta join the developers list, I guess, so I can put my
money where my mouth is, and submit changes to clarify the docs, etc,
when I go roaming through the source.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: corrupted index

2002-03-16 Thread Steven J. Owens

Otis,

 You can remove the .lock file and try re-indexing or continuing
 indexing where you left off.
 I am not sure about the corrupt index.  I have never seen it happen,
 and I believe I recall reading some messages from Doug Cutting saying
 that index should never be left in an inconsistent state.  

 Obviously never should be, but if something's pulling the rug
out from under his JRE, changes could be only partially written,
right?  

 Or is the writing format in some sense transactionally safe?
I've never worked directly on something like this, but I worked at a
database software company where they used transaction semantics and a
journaling scheme to fake a bulletproof file system.  Is this how
the index-writing code is implemented?

 In general, I can guess Doug's response - just torch the old
index directory and rebuild it; Lucene's indexing is fast enough that
you don't need to get clever.  This seems to be Doug's stance in
general (i.e. don't get fancy, I already put all the fanciness you'll
need into extremely fast indexing and searching).  So far, it seems
to work :-).

 I could be making this up, though, so I suggest you search through
 lucene-user and lucene-dev archives on www.mail-archive.com.
 A search for corrupt should do it.
 Once you figure things out maybe you can post a summary here.

 I got a little curious, so I went and did the searches.  There is
exactly one message in each list archive (dev and users) with the
keyword corrupt in it.  The lucene-users instance is irrelevant:

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00557.html

 The lucene-dev instance is more useful:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00157.html

 It's a post from Doug, dated sept 27, 2001, about adding not just
thread-safety but process-safety:

  It should be impossible to corrupt an index through the Lucene API.
  However if a Lucene process exits unexpectedly it can leave the index
  locked.  The remedy is simply to, at a time when it is certain that no
  processes are accessing the index, remove all lock files.
  
 So it sounds like it's worth trying just removing the lock files.
Hm, is there a way to come up with a sanity check you can run on an
index to make sure it's not corrupted?  This might be an excellent
thing to reassure yourself with: something went wrong?  Run a sanity
check, if it fails just reindex.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Indexing Dynamically Generated Web Pages

2002-02-20 Thread Steven J. Owens

Paul Friedman [EMAIL PROTECTED] asks:
 I am a novice developer researching Lucene for use on a web site
 that primarily uses JSPs.  How do you index dynamically generated
 web pages with Lucene?  Or is it even possible?
 The JSPs themselves don't have searchable data, only methods to get
 that data.  When parsing these files for indexing, the necessary
 data for the search would not yet be in the page.

 Lucene doesn't do any crawling for you - it's your responsibility
to write the code that crawls your website, i.e. selects the data
sources to be indexed, chooses how to convert them to Lucene
Documents, and adds them to the indexes.  I suspect you're assuming
you would use the demo application.  That would not be appropriate for
your project.

 Instead, what you should do is:

1) write some java code that walks through your data source - i.e. if
your data source is a database, it would select each row in the
appropriate table - and composes a Lucene Document, and adds it to
your index.

2) write some servlets/jsps that do the search, and when it gets to
the point where you would redirect the user's browser off to the
appropriate HTML page, you either:

 a) invoke the appropriate JSP in your site with appropriate
arguments

 or

 b) compose a page dynamically, roughly the same way the JSP page
would compose it, and return it to the user.

 or

 c) redirect the user to a new JSP page, which you will have to
write, which looks much like your old JSP page, but is
designed to be invoked with an argument specifying what
data to load.

 You may find http://darksleep.com/puff/lucene/lucene.html useful
in further clarifying this.  There is also a getting started article
at http://jakarta.apache.org/lucene/docs/index.html that may be of
some use.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: excluding files / refining search

2002-02-14 Thread Steven J. Owens

Brian Rook [EMAIL PROTECTED] writes:
 The site I'm working on has a lot of small html files that are used for page
 construction (nav bars, footers, etc) and they're being returned high in the
 results because they contain the search term(s) I'm looking for and are
 small so they rank higher than larger documents.
 
 I want to exclude them from the index and I've come up with two ideas:
 
 1) move them to a directory, which I will exclude from the index, but I'll
 have to change a bunch of links
 
 2) detect them with some sort of flag and exclude them from the index.  We
 were thinking that we could have a fake tag that lucene would detect and not
 index those pages.

Why not just have an exclude list of some sort?  In the code you
wrote to select files for indexing, just have it check against a list
of files you want to exclude.  In the demo application, you would edit

 jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java


 The quick and dirty method would be to edit this section of code:

   public static void indexDocs(IndexWriter writer, File file)
throws Exception {
 if (file.isDirectory()) {
   String[] files = file.list();
   for (int i = 0; i  files.length; i++)
 indexDocs(writer, new File(file, files[i]));
 } else {
   System.out.println(adding  + file);
   writer.addDocument(FileDocument.Document(file));
 }
   }
 
 
 To something like this:
 
 
   public static void indexDocs(IndexWriter writer, File file)
throws Exception {
 if (file.isDirectory()) {
   String[] files = file.list();
   for (int i = 0; i  files.length; i++)
 indexDocs(writer, new File(file, files[i]));
 } else {
   if (checkFileName(file)) {
 System.out.println(skipping  + file) ;
   } else {
 System.out.println(adding  + file);
 writer.addDocument(FileDocument.Document(file));
   }
 }
   }
 
   public static boolean checkFileName(File file) {
 String name = file.getName() ;
 if (name == footer.html || 
 name == header.html || 
 name == menu.html || 
 name == navbar.html) {
return false ;
 } 
 return true ;
   }
 
 
 A more realistic implementation would use an exclude file of
filenames to ignore, load them into a collection (probably a HashSet)
and keep that collection around as an instance variable.  Then
checkFileName() just returns !excludedSet.contains(name).

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




eyebrowse database dependency?

2001-12-13 Thread Steven J. Owens

Hi folks,

  I've been looking at Eyebrowse, an email archive indexing
program built on top of Lucene.  Eyebrowse also uses mysql to store
database tables about messages (tables containing mbox filenames,
message locations, authors, subjects, threads, date ranges, etc).
When I was thinking about designing a similar store, I expected that
most of the details could be stored as Lucene fields.  Some of them
might be a bit of a stretch, and certainly there would have to be at
least a few non-Lucene details stored elsewhere (like the most recent
mbox file and the location of the end of the most recently indexed
message), but those seem, to my inexperienced eye, tolerable to avoid
the overhead of an entire database added to the system

 I believe one of the eyebrowse developers is a member here.  I'm
curious as to the design factors that influenced choosing to use a
database in addition to Lucene itself.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: OutOfMemoryError

2001-11-29 Thread Steven J. Owens

Chantal,
 For what I know now, I had a bug in my own code. still I don't understand 
 where these OutOfMemoryErrors came from. I will try to index again in one 
 thread without RAMDirectory just to check if the program is sane.

 Java often has misleading error messages.  For example, on
solaris machines the default ulimit used to be 24 - that's 24 open
file handles!  Yeesh. This will cause an OutOfMemoryError.  So don't
assume it's actually a memory problem, particularly if a memory
problem doesn't particularly make sense.  Just a thought.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: OutOfMemoryError

2001-11-29 Thread Steven J. Owens

I wrote:
   Java often has misleading error messages.  For example, on
  solaris machines the default ulimit used to be 24 - that's 24 open
  file handles!  Yeesh. This will cause an OutOfMemoryError.  So don't

Jeff Trent replied:
 Wow.  I did not know that!
 
 I also don't see an option to increase that limit from java -X.  Do you know
 how to increase that limit?

 That's used to be, I think it's larger on newer machines.  I
don't think there's a java command line option to set this, it's a
system limit.  The solaris command to check it is ulimit.  To set it
for a given login process (assuming sufficient privileges) use ulimit
number (i.e.  ulimit 128).  ulimit -a prints out all limits.

Steven J. Owens
[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Tutorial

2001-10-31 Thread Steven J. Owens

 Is anyone else out there creating a tutorial. I would be willing to 
 compile and coordinate all the different parts for the people who want 
 to contribute.

 I wrote something up a few months ago, and posted it to the list.
I've been meaning to get back to it, edit it again, spiff it up a bit,
maybe come up with some updated code for the examples written on my own
time (instead of my employers) so I can publish it, etc.  But right now,
the current draft is at:

http://darksleep.com/puff/lucene.html

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Context specific summary with the search term

2001-10-26 Thread Steven J. Owens

Lee Mallabone wrote:
 Okay, I'm now not entirely certain how useful a generic solution will be
 to me, given the non-generic nature of the content I'm indexing. I think
 there a lot of optomizations I can make that wouldn't be generic.

 Early optimization is the root of all evil.  

 Seriously, though, one thing I see Doug say often is that
lucene's indexing and searching are designed to be extremely fast.  He
often responds to questions about odd details - for example, the
classic do a search and cache the search results for paging across
multiple web pages - by saying to just use the brute force approach
and rely on the speed of the lucene index.  

 I like to say, I assume that there are people out there with a
lot more on the ball than me about things like optimization.  I try to
use their brains as much as possible :-).  For example, with
compilers, I assume the compiler writer knew a lot more about
optimization than I do.  People talk about the compiler not having the
human judgement to know what's best.  That's true, but the way to deal
with that is not to try to hand-optimize my code and outguess the
compiler (which will only will only confuse the compiler and prevent
it from doing what it was designed to do).  The compiler can best
optimize the program if I focus on making it clear what my intent is,
what the program is meant to do, in the structure of the code first.

 This leads to another optimization slogan that I remember reading
- algorithmic optimization is much better than spot optimization.  In
other words, before you try to figure out a faster way to do
something, figure out if you're doing the thing that accomplishes your
true goal in the fastest way.  And figure out how important that thing
is in the grand scheme of things.

Steven J. Owens
[EMAIL PROTECTED]



Re: new Lucene release: 1.2 RC2

2001-10-22 Thread Steven J. Owens

Sunil,

 Two weeks back I did have the problem which I stated.
 But I am unable to reproduce the results currently. I tested and retested
 but couldnt repeat the same.
 Doug have U guys fixed the issue long back itself ?
 (The only thing I have done fresh is to download the latest
 lucene-1.2-rc1.zip file and re-installed lucene  - since it came along with
 source code)

 I believe what Sunil was trying to describe was:

a) sucessfully created index
b) did searches on index
c) started to update index
d) clicked exit from app before updating completed
e) ran app again 
f) could not repeat searches from step b.

 Doug has recently mentioned several improvements in the past
month or so.  I'm looking forward to moving over to the new version,
which is among other things thread safe.  This sounds like the area
where you were probably having problems, which means it's likely that
Doug's changes fixed it.

 This is, by the way, why it's important to report problems
early, ideally along with test data and code to reproduce it, or at
least detailed descriptions of how to reproduce it...

Steven J. Owens
[EMAIL PROTECTED]



Re: Index Optimization: Which is Better?

2001-10-11 Thread Steven J. Owens

Doug wrote:

 I'm having trouble getting a clear picture of your indexing scheme.

 I've been doing a lot of thinking about this same problem, so I
may be a little more in tune with what Elliot's saying.  By the way,
Elliot, I'm very interested in your results.  I considered the basic
approach you're using, but I thought it was a bit extreme in terms of
having zillions of tiny lucene Documents.  I'm working on a quick
kludge that may serve my immediate purposes (if it does, I'm planning
to post the deatils here).
 
 Could you provide some simple examples, e.g., for the xml:

   tag1this is some text
 tag2and some other text/tag2
   /tag1
 would you have something like the following?
   doc1
 node_type: tag1
 contents: this is some text
   doc2
 node_type: tag2
 contents: and some other text
   doc3
 node_type: all_contents
 contents: this is some text and some other text

 I think that's exactly what Elliot is intending.
 

 My first instinct would be to have something like:
   doc1
 tag1: this is some text
 tag2: and some other text
 all-tags: this is some text and some other text
 What do you need that that does not achieve?

 Name collision - you can have multiple Elements at different
levels, and you may have attributes and tags having the same name.
Obviously one way around this is Don't do that, but that could get
really tiresome, quickly.

 If you just conflate the elements and attributes under the same
name (i.e. field blah contains a concatenated set of values from all
occurrences of both elements and attributes) then your searches become
much more limited in what you can specify.  This is, by the way, the
approach I'm trying out, with a second stage to refine the results and
drop out false positives.  But I'll have to wait on saying any more
about that.

 All of this, of course, is in the context of having arbitrary XML
documents.  If you have predefined XML schemas then you can hand-code
the mappings from elements to lucene document fields.  But then you
trade a heck of a lot of flexibility for a lot of maintenance.

Steven J. Owens
[EMAIL PROTECTED]