Re: Zip Files

2005-03-01 Thread Chris Lamprecht
Luke,

Look at the javadocs for java.io.ByteArrayInputStream - it wraps a
byte array and makes it accessible as an InputStream.  Also see
java.util.zip.ZipFile.  You should be able to read and parse all
contents of the zip file in memory.

http://java.sun.com/j2se/1.4.2/docs/api/java/io/ByteArrayInputStream.html


On Tue, 1 Mar 2005 12:39:17 -0500, Luke Shannon
[EMAIL PROTECTED] wrote:
 Thanks Ernesto.
 
 I'm struggling with how I can work with an  array of bytes  instead of a
 Java File.
 
 It would be easier to unzip the zip to a temp directory, parse the files and
 than delete the directory. But this would greatly slow indexing and use up
 disk space.
 
 Luke
 
 - Original Message -
 From: Ernesto De Santis [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Tuesday, March 01, 2005 10:48 AM
 Subject: Re: Zip Files
 
  Hello
 
  first, you need a parser for each file type: pdf, txt, word, etc.
  and use a java api to iterate zip content, see:
 
  http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
 
  use getNextEntry() method
 
  little example:
 
  ZipInputStream zis = new ZipInputStream(fileInputStream);
  ZipEntry zipEntry;
  while(zipEntry = zis.getNextEntry() != null){
  //use zipEntry to get name, etc.
  //get properly parser for current entry
  //use parser with zis (ZipInputStream)
  }
 
  good luck
  Ernesto
 
  Luke Shannon escribió:
 
  Hello;
  
  Anyone have an ideas on how to index the contents within zip files?
  
  Thanks,
  
  Luke
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
  
 
  --
  Ernesto De Santis - Colaborativa.net
  Córdoba 1147 Piso 6 Oficinas 3 y 4
  (S2000AWO) Rosario, SF, Argentina.
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Chris Lamprecht
I should have mentioned, the reason for not doing this the obvious,
simple way (just close the Searcher and reopen it if a new version is
available) is because some threads could be in the middle of iterating
through the search Hits.  If you close the Searcher they get a Bad
file descriptor IOException.  As I found out the hard way :)


On Fri, 18 Feb 2005 15:03:29 -0600, Chris Lamprecht
[EMAIL PROTECTED] wrote:
 I recently dealt with the issue of re-using a Searcher with an index
 that changes often.  I wrote a class that allows my searching classes
 to check out a lucene Searcher, perform a search, and then return
 the Searcher.  It's similar to a database connection pool, except that

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Chris Lamprecht
Wouldn't this leave open file handles?   I had a problem where there
were lots of open file handles for deleted index files, because the
old searchers were not being closed.

On Fri, 18 Feb 2005 13:41:37 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Or you could just open a new IndexSearcher, forget the old one, and
 have GC collect it when everyone is done with it.
 
 Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Subversion conversion

2005-02-02 Thread Chris Lamprecht
One thing about subversion branches (from Key Concepts Behind
Branches in chapter 4 of the subversion book):

2. Subversion has no internal concept of a branchonly copies. When
you copy a directory, the resulting directory is only a branch
because you attach that meaning to it. You may think of the directory
differently, or treat it differently, but to Subversion it's just an
ordinary directory that happens to have been created by copying.


On Wed, 2 Feb 2005 19:49:53 -0500, Chakra Yadavalli
[EMAIL PROTECTED] wrote:
 Hello ALL, It might not be the right place for it but as we are talking
 about SCM, I have a quick question. First, I haven't used CVS/SVN on any
 project. I am a ClearCase/PVCS guy. I just would like to know WHICH
 CONFIGURATION MANAGEMENT PLAN DO YOU FOLLOW IN LUCENE DEVELOPMENT.
 
 PLAN A: DEVELOP IN TRUNK AND BRANCH OFF ON RELEASE
 Recently I had a discussion with a friend about developing in the TRUNK
 (which in the /main in ClearCase speak),  which my friend claims that is
 done in the APACHE/Open Source projects. The main advantage he pointed
 was that Merging could be avoided if you are developing in the TRUNK.
 And when there is a release, they create a new Branch (say LUCENE_1.5
 branch) and label them. That branch will be used for maintenance and any
 code deltas will be merged back to TRUNK as needed.
 
 PLAN B: BRANCH OF BEFORE PLANNED RELEASE AND MERGE BACK TO MAIN/TRUNK
 As I am from a private workspace/isolated development school of
 thought promoted by ClearCase, I am used to create a branch at the
 project/release initiation and develop in that branch (say /main/dev).
 Similarly, we have /main/int for making changes when the project goes to
 integration phase, and a /main/acp branch for acceptance. In this
 school, the /main will always have fewer versions of files and the
 difference between any two consecutive versions is the NET CHANGE of
 that SCM element (either file or dir) between two releases (say LUCENE
 1.4 and 1.5).
 
 Thanks in advance for your time.
 Chakra Yadavalli
 http://jroller.com/page/cyblogue
 
  -Original Message-
  From: aurora [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 02, 2005 4:25 PM
  To: lucene-user@jakarta.apache.org
  Subject: Re: Subversion conversion
 
  Subversion rocks!
 
  I have just setup the Windows svn client TortoiseSVN with my favourite
  file manager Total Commander 6.5. The svn status and commands are
  readily
  integrated with the file manager. Offline diff and revert are two things
  I
  really like from svn.
 
   The conversion to Subversion is complete.  The new repository is
   available to users read-only at:
  
 http://svn.apache.org/repos/asf/lucene/java/trunk
  
   Besides /trunk, there is also /branches and /tags.  /tags contains all
 
   the CVS tags made so that you could grab a snapshot of a previous
   version.  /trunk is analogous to CVS HEAD.  You can learn more about
  the
   Apache repository configuration here and how to use the command-line
   client to check out the repository:
  
 http://www.apache.org/dev/version-control.html
  
   Learn about Subversion, including the complete O'Reilly Subversion
  book
   in electronic form for free here:
  
 http://subversion.tigris.org
  
   For committers, check out the repository using https and your Apache
   username/password.
  
   The Lucene sandbox has been integrated into our single Subversion
   repository, under /java/trunk/sandbox:
  
 http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/
  
   The Lucene CVS repositories have been locked for read-only.
  
   If there are any issues with this conversion, let me know and I'll
  bring
   them to the Apache infrastructure group.
  
 Erik
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 --
 Visit my weblog: http://www.jroller.com/page/cyblogue
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding Fields to Document (with same name)

2005-02-01 Thread Chris Lamprecht
Hi Karl,

From _Lucene in Action_, section 2.2, when you add the same field with
different values, Internally, Lucene appends all the words together
and index them in a single Field ..., allowing you to use any of the
given words when searching.

See also http://www.lucenebook.com/search?query=appendable+fields

-chris

On Tue, 1 Feb 2005 11:42:23 +0100 (MET), [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Hi,
 
 what happens when I add two fields with the same name to one Document?
 
 Document doc = new Document();
 doc.add(Field.Text(bla, this is my first text));
 doc.add(Field.Text(bla, this is my second text));
 
 Will the second text overwrite the first, because only one field can be held
 with the same name in one document?
 
 Will the first and the second text be merged, when I search in the field bla
 (e.g. with query bla:text) ?
 
 I am working on XML indexing and did not get an error when having repeated
 XML fields. Now I am wondering...
 
 Karl
 
 --
 Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching with words that contain % , / and the like

2005-01-27 Thread Chris Lamprecht
Without looking at the source, my guess is that StandardAnalyzer (and
StandardTokenizer) is the culprit.  The StandardAnalyzer grammar (in
StandardTokenizer.jj) is probably defined so x/y parses into two
tokens, x and y.  s is a default stopword (see
StopAnalyzer.ENGLISH_STOP_WORDS), so it gets filtered out, while p
does not.

To get what you want, you can use a WhitespaceAnalyzer, write your own
custom Analyzer or Tokenizer, or modify the StandardTokenizer.jj
grammar to suit your needs.  WhitespaceAnalyzer is much simpler than
StandardAnalyzer, so you may see some other things being tokenized
differently.

-Chris

On Thu, 27 Jan 2005 12:12:16 +0530, Robinson Raju
[EMAIL PROTECTED] wrote:
 Hi ,
 
 Is there a way to search for words that contain / or % .
 if my query is test/s , it is just taken as test
 if my query is test/p , it is just taken as test p
 has anyone done this / faced such an issue ?
 
 Regards
 Robin
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reloading an index

2005-01-27 Thread Chris Lamprecht
I just ran into a similar issue.  When you close an IndexSearcher, it
doesn't necessarily close the underlying IndexReader.  It depends
which constructor you used to create the IndexSearcher.  See the
constructors javadocs or source for the details.

In my case, we were updating and optimizing the index from another
process, and reopening IndexSearchers.  We would eventually run out of
disk space because it was leaving open file handles to deleted files,
so the disk space was never being made available, until the JVM
processes ended.  If you're under linux, try running the 'lsof'
command to see if there are any handles to files marked (deleted).

-Chris

On Thu, 27 Jan 2005 08:28:30 -0800 (PST), Greg Gershman
[EMAIL PROTECTED] wrote:
 I have an index that is frequently updated.  When
 indexing is completed, an event triggers a new
 Searcher to be opened.  When the new Searcher is
 opened, incoming searches are redirected to the new
 Searcher, the old Searcher is closed and nulled, but I
 still see about twice the amount of memory in use well
 after the original searcher has been closed.   Is
 there something else I can do to get this memory
 reclaimed?  Should I explicitly call garbarge
 collection?  Any ideas?
 
 Thanks.
 
 Greg Gershman
 
 __
 Do you Yahoo!?
 Meet the all-new My Yahoo! - Try it today!
 http://my.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Chris Lamprecht
As they say, nothing lasts forever ;)

I like the idea.  If a project like this gets going, I think I'd be
interested in helping.

The Google mini looks very well done (they have two demos on the web
page).  For $5000, it's probably a very good solution for many
businesses.  If the demos are accurate, it seems like you almost
literally plug it in, configure a few things using the web interface,
and you're in business.   Demos are at
http://www.google.com/enterprise/mini/product_tours_demos.html

-chris

On Thu, 27 Jan 2005 17:40:53 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 I discuss this with myself a lot inside my head... :)
 Seriously, I agree with Erik.  I think this is a business opportunity.
 How many people are hating me now and going shh?  Raise your
 hands!
 
 Otis
 
 --- David Spencer [EMAIL PROTECTED] wrote:
 
  This reminds me, has anyone every discussed something similar:
 
  - rackmount server ( or for coolness factor, that mini mac)
  - web i/f for config/control
 
  - of course the server would have the following s/w:
  -- web server
  -- lucene / nutch
 
  Part of the work here I think is having a decent web i/f to configure
 
  the thing and to customize the LF of the search results.
 
 
 
  jian chen wrote:
   Hi,
  
   I was searching using google and just found that there was a new
   feature called google mini. Initially I thought it was another
  free
   service for small companies. Then I realized that it costs quite
  some
   money ($4,995) for the hardware and software. (I guess the
  proprietary
   software costs a whole lot more than actual hardware.)
  
   The nice feature is that, you can only index up to 50,000
  documents
   with this price. If you need to index more, sorry, send in the
   check...
  
   It seems to me that any small biz will be ripped off if they
  install
   this google mini thing, compared to using Lucene to implement a
  easy
   to use search software, which could search up to whatever number of
   documents you could image.
  
   I hope the lucene project could get exposed more to the enterprise
  so
   that people know that they have not only cheaper but more
  importantly,
   BETTER alternatives.
  
   Jian
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE + EXCEPTION

2005-01-24 Thread Chris Lamprecht
Hi Karthik,

If you are talking about SingleThreadModel (i.e. your servlet
implements javax.servlet.SingleThreadModel), this does not guarantee
that two different instances of your servlet won't be run at the same
time.  It only guarantees that each instance of your servlet will only
be run by one thread at a time.  See:

http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/SingleThreadModel.html

If you are accessing a shared resource (a lucene index), you'll have
to prevent concurrent modifications somehow other than
SingleThreadModel.

I think they've finally deprecated SingleThreadModel in the latest
(may be not even out yet) servlet spec.

-chris

 
 On STANDALONE Usge of   UPDATION/DELETION/ADDITION of Documents into
 MergerIndex, the  Code of mine
 
 
 runs PERFECTLY  with out any Problems.
 
 
 But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet
 Running in SINGLE THREAD MODE,Some times 
 
 
 Frequently I get the Error as below

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-21 Thread Chris Lamprecht
Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
[EMAIL PROTECTED] wrote:
 OK, OK ... I'll buy the book. I guess its about time since I am deeply
 and forever in love with Lucene. Might as well take the final plunge.
 
 
 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Friday, January 21, 2005 9:12 AM
 To: Lucene Users List
 Subject: Re: Stemming
 
 Hi Kevin,
 
 Stemming is an optional operation and is done in the analysis step.
 Lucene comes with a Porter stemmer and a Filter that you can use in an
 Analyzer:
 
 ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
 ./src/java/org/apache/lucene/analysis/PorterStemmer.java
 
 You can find more about it here:
 http://www.lucenebook.com/search?query=stemming
 You can also see mentions of SnowballAnalyzer in those search results,
 and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
 
 Otis
 
 --- Kevin L. Cobb [EMAIL PROTECTED] wrote:
 
  I want to understand how Lucene uses stemming but can't find any
  documentation on the Lucene site. I'll continue to google but hope
  that
  this list can help narrow my search. I have several questions on the
  subject currently but hesitate to list them here since finding a good
  document on the subject may answer most of them.
 
 
 
  Thanks in advance for any pointers,
 
 
 
  Kevin
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Chris Lamprecht
What about a shutdown hook?
  
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() { /* whatever */ }
});

see also http://www.onjava.com/pub/a/onjava/2003/03/26/shutdownhook.html


On Tue, 11 Jan 2005 13:21:42 -0800, Doug Cutting [EMAIL PROTECTED] wrote:
 Joseph Ottinger wrote:
  As one for whom the question's come up recently, I'd say that locks need
  to be terminated gracefully, instead. I've noticed a number of cases where
  the locks get abandoned in exceptional conditions, which is almost exactly
  what you don't want.
 
 The problem is that this is hard to do from Java.  A typical approach is
 to put the process id in the lock file, then, if that process is dead,
 ignore the lock file.  But Java does not let one know process ids.  Java
 1.4 provides a LockFile mechanism which should mostly solve this, but
 Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that
 feature.  Lucene 2.0 is likely to require Java 1.4 and should be able to
 do a better job of automatically unlocking indexes when processes die.
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread Chris Lamprecht
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds).  So
if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.

On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
[EMAIL PROTECTED] wrote:
 
 Google just came out with a page that gives you feedback as to how many
 pages will match your query and variations on it:
 
 http://www.google.com/webhp?complete=1hl=en
 
 I had an unexposed experiment I had done with Lucene a few months ago
 that this has inspired me to expose - it's not the same, but it's
 similar in that as you type in a query you're given *immediate* feedback
 as to how many pages match.
 
 Try it here: http://www.searchmorph.com/kat/isearch.html
 
 This is my SearchMorph site which has an index of ~90k pages of open
 source javadoc packages.
 
 As you type in a query, on every keystroke it does at least one Lucene
 search to show results in the bottom part of the page.
 
 It also gives spelling corrections (using my NGramSpeller
 contribution) and also suggests popular tokens that start the same way
 as your search query.
 
 For one way to see corrections in action, type in rollback character
 by character (don't do a cut and paste).
 
 Note that:
 -- this is not how the Google page works - just similar to it
 -- I do single word suggestions while google does the more useful whole
 phrase suggestions (TBD I'll try to copy them)
 -- They do lots of javascript magic, whereas I use old school frames mostly
 -- this is relatively expensive, as it does 1 query per character, and
 when it's doing spelling correction there is even more work going on
 -- this is just an experiment and the page may be unstable as I fool w/ it
 
 What's nice is when you get used to immediate results, going back to the
 batch way of searching seems backward, slow, and old fashioned.
 
 There are too many idle CPUs in the world - this is one way to keep them
 busier :)
 
 -- Dave
 
 PS Weblog entry updated too:
 http://www.searchmorph.com/weblog/index.php?id=26
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many open files issue

2004-11-22 Thread Chris Lamprecht
A useful resource for increasing the number of file handles on various
operating systems is the Volano Report:

http://www.volano.com/report/

 I had requested help on an issue we have been facing with the Too many
 open files Exception garbling the search indexes and crashing the
 search on the web site.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread Chris Lamprecht
John,

It actually should be pretty easy to use just the parts of Lucene you
want (the analyzers, etc) without using the rest.  See the example of
the PorterStemmer from this article:

http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2

You could feed a Reader to the tokenStream() method of
PorterStemAnalyzer, and get back a TokenStream, from which you pull
the tokens using the next() method.



On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 
 Is there a way to use Lucene stemming and stop word removal without using the 
 rest of the tool?   I am downloading the code now, but I imagine the answer 
 might be deeply burried.  I would like to be able to send in a phrase and get 
 back a collection of keywords if possible.
 
 I am thinking of using an intermediary solution before moving fully to 
 Lucene.  I don't have time to spend a month making a carefully tested, 
 administratable Lucene solution for my site yet, but I intend to do so over 
 time.  Funny thing is the Lucene code likely would only take up a couple 
 hundred of lines, but integration and administration would take me much more 
 time.
 
 In the meantime, I am thinking I could use perhaps Lucene steming and parsing 
 of words, then stick each search word along with the associated primary key 
 in an indexed MySql table.   Each record I would need to do this to is small 
 with maybe only average 15 userful words.   I would be able to have an 
 in-database solution though ranking, etc would not exist.   This is better 
 then the exact word searching i have currently which is really bad.
 
 By the way, MySql 4.1.1 has some Lucene type handling, but it too does not 
 have stemming and I am sure it is very slow compaired to Lucene.   Cpanel is 
 still stuck on MySql 4.0.* so many people would not have access to even this 
 basic ability in production systems for some time yet.
 
 JohnE
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Locking Issues Resolved...I hope

2004-11-16 Thread Chris Lamprecht
MySQL does offer a basic fulltext search (with MyISAM tables), but it
doesn't really approach the functionality of Lucene, such as pluggable
tokenizers, stemming, etc.  I think MS SQL server has fulltext search
as well, but I have no idea if it's any good.

See http://www.google.com/search?hl=enlr=safe=offc2coff=1q=mysql+fulltext

 I have not seen clear yet because it is all new.   I wish a database Text 
 field could have this sort of mechanism built into it.   MySql does not do 
 this (what I am using), but I am going to check into other databases now.  
 OJB will work with most all of them so that would help if there is a database 
 type of solution that will allow that sleep at night thing to happen!!!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
I'd like to implement a search across several types of entities,
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:

Search results for: philosophy boyer

Found: 121 classes - 5 professors - 2 departments

search results here...


I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!

-Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
Nader and Chuck,

Thanks for the responses, they're both helpful.  My index sizes will
begin on the order of 200,000 classes, and 20,000 instructors (and
much fewer departments), and grow over time to maybe a few million
classes.  Compared to some of the numbers I've seen on this mailing
list, my dataset is fairly small.  I think I'll not worry about
performance for now, until  unless it becomes an issue.

-Chris

On Sat, 13 Nov 2004 15:36:11 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 My Lucene application includes multi-faceted navigation that does a more
 complex version of the below.  I've got 5 different taxonomies into
 which every indexed item is classified.  The largest of the taxonomies
 has over 15,000 entries while the other 4 are much smaller. For every
 search query, I determine the best small set of nodes from each taxonomy
 to present to the user as drill down options, and provide the counts
 regarding how many results fall under each of these nodes.  At present I
 only have about 25,000 indexed objects and usually no more than 1,000
 results from the initial query.  To determine the drill-down options and
 counts, I scan up to 1,000 results computing the counts for all nodes
 into which these results classify.  Then for each taxonomy I pick the
 best drill-down options available (orthogonal set with reasonable
 branching factor that covers all results) and present them with their
 counts.  If there are more than 1,000 results, I extrapolate the
 computed counts to estimate the actual counts on the entire set of
 results.  This is all done with a single index and a single search.
 
 The total time required for performing this computation for the one
 large taxonomy is under 10ms, running in full debug mode in my ide.  The
 query response time overall is subjectively instantaneous at the UI
 (Google-speed or better).  So, unless some dimension of the problem is
 much bigger than mine, I doubt performance will be an issue.
 
 Chuck
 
 
 
   -Original Message-
   From: Nader Henein [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 13, 2004 2:29 AM
   To: Lucene Users List
   Subject: Re: How to efficiently get # of search results, per
 attribute
  
   It depends on how many results they're looking through, here are two
   scenarios I see:
  
   1] If you don't have that many records you can fetch all the results
 and
   then do a post parsing step the determine totals
  
   2] If you have a lot of entries in each category and you're worried
   about fetching thousands of records every time, you can just have
   seperate indecies per category and search them in in parallel (not
   Lucene Parallel Search) and you can get up to 100 hits for each one
   (efficiency) but you'll also have the total from the search to
 display.
  
   Either way you can boost up speed using RamDirectory if you need
 more
   speed from the search, but whichever approach you choose I would
   recommend that you sit down and do some number crunching to figure
 out
   which way to go.
  
  
   Hope this helps
  
   Nader Henein
  
  
  
   Chris Lamprecht wrote:
  
   I'd like to implement a search across several types of entities,
   let's say, classes, professors, and departments.  I want the user
 to
   be able to enter a simple, single query and not have to specify
 what
   they're looking for.  Then I want the search results to be
 something
   like this:
   
   Search results for: philosophy boyer
   
   Found: 121 classes - 5 professors - 2 departments
   
   search results here...
   
   
   I know I could iterate through every hit returned and count them up
   myself, but that seems inefficient if there are lots of results.
 Is
   there some other way to get this kind of information from the
 search
   result set?  My other ideas are: doing a separate search each
 result
   type, or storing different types in different indexes.  Any
   suggestions?  Thanks for your help!
   
   -Chris
   
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
   
   
   
   
   
   
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]