Re: Zip Files

2005-03-01 Thread Chris Lamprecht
Luke,

Look at the javadocs for java.io.ByteArrayInputStream - it wraps a
byte array and makes it accessible as an InputStream.  Also see
java.util.zip.ZipFile.  You should be able to read and parse all
contents of the zip file in memory.

http://java.sun.com/j2se/1.4.2/docs/api/java/io/ByteArrayInputStream.html


On Tue, 1 Mar 2005 12:39:17 -0500, Luke Shannon
<[EMAIL PROTECTED]> wrote:
> Thanks Ernesto.
> 
> I'm struggling with how I can work with an  array of bytes  instead of a
> Java File.
> 
> It would be easier to unzip the zip to a temp directory, parse the files and
> than delete the directory. But this would greatly slow indexing and use up
> disk space.
> 
> Luke
> 
> - Original Message -
> From: "Ernesto De Santis" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Tuesday, March 01, 2005 10:48 AM
> Subject: Re: Zip Files
> 
> > Hello
> >
> > first, you need a parser for each file type: pdf, txt, word, etc.
> > and use a java api to iterate zip content, see:
> >
> > http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
> >
> > use getNextEntry() method
> >
> > little example:
> >
> > ZipInputStream zis = new ZipInputStream(fileInputStream);
> > ZipEntry zipEntry;
> > while(zipEntry = zis.getNextEntry() != null){
> > //use zipEntry to get name, etc.
> > //get properly parser for current entry
> > //use parser with zis (ZipInputStream)
> > }
> >
> > good luck
> > Ernesto
> >
> > Luke Shannon escribió:
> >
> > >Hello;
> > >
> > >Anyone have an ideas on how to index the contents within zip files?
> > >
> > >Thanks,
> > >
> > >Luke
> > >
> > >
> > >-
> > >To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > >
> > >
> >
> > --
> > Ernesto De Santis - Colaborativa.net
> > Córdoba 1147 Piso 6 Oficinas 3 y 4
> > (S2000AWO) Rosario, SF, Argentina.
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Chris Lamprecht
Wouldn't this leave open file handles?   I had a problem where there
were lots of open file handles for deleted index files, because the
old searchers were not being closed.

On Fri, 18 Feb 2005 13:41:37 -0800 (PST), Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Or you could just open a new IndexSearcher, forget the old one, and
> have GC collect it when everyone is done with it.
> 
> Otis
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Chris Lamprecht
I should have mentioned, the reason for not doing this the obvious,
simple way (just close the Searcher and reopen it if a new version is
available) is because some threads could be in the middle of iterating
through the search Hits.  If you close the Searcher they get a Bad
file descriptor IOException.  As I found out the hard way :)


On Fri, 18 Feb 2005 15:03:29 -0600, Chris Lamprecht
<[EMAIL PROTECTED]> wrote:
> I recently dealt with the issue of re-using a Searcher with an index
> that changes often.  I wrote a class that allows my searching classes
> to "check out" a lucene Searcher, perform a search, and then return
> the Searcher.  It's similar to a database connection pool, except that

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-18 Thread Chris Lamprecht
I recently dealt with the issue of re-using a Searcher with an index
that changes often.  I wrote a class that allows my searching classes
to "check out" a lucene Searcher, perform a search, and then return
the Searcher.  It's similar to a database connection pool, except that
all clients can share the same Searcher (I don't think there is any
benefit to keeping a true "pool" and giving a different Searcher to
each client -- someone let me know if this is incorrect).

So I just keep a reference count to my Searcher, which gets
incremented at checkout and decremented at checkin.  So the logic is
approximately:

initialize lastVersion to -1

checkout:
if (lucene index version != lastVersion) {
   create a new IndexSearcher and update lastVersion
} 
refcount++;
return the searcher

And on checkin:
refcount--;
if (refcount ==0 and there is a newer lucene index version) {
   close the searcher being checked in
}


Of course there are some more details to keep info on the open
searchers, make it thread-safe, etc.  I also plan to only check for a
new index if some minimum time threshold has passed (5 minutes or so).
 I'd be interested in hearing others' solutions/patterns for this.

-Chris

On Fri, 18 Feb 2005 11:57:32 -0500, Michael Celona
<[EMAIL PROTECTED]> wrote:
> My index is changing in real time constantly... in this case I guess this
> will not work for me any suggestions...
> 
> Michael
> 
> -Original Message-
> From: David Townsend [mailto:[EMAIL PROTECTED]
> Sent: Friday, February 18, 2005 11:50 AM
> To: Lucene Users List
> Subject: RE: Search Performance
> 
> IndexSearchers are thread safe, so you can use the same object on multiple
> requests.  If the index is static and not constantly updating, just keep one
> IndexSearcher for the life of the app.  If the index changes and you need
> that instantly reflected in the results, you need to check if the index has
> changed, if it has create a new cached IndexSearcher.  To check for changes
> use you'll need to monitor the version number of the index obtained via
> 
> IndexReader.getCurrentVersion(Index Name)
> 
> David
> 
> -Original Message-
> From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
> Sent: 18 February 2005 16:15
> To: Lucene Users List
> Subject: Re: Search Performance
> 
> Try a singleton pattern or an static field.
> 
> Stefan
> 
> Michael Celona wrote:
> 
> >I am creating new IndexSearchers... how do I cache my IndexSearcher...
> >
> >Michael
> >
> >-Original Message-
> >From: David Townsend [mailto:[EMAIL PROTECTED]
> >Sent: Friday, February 18, 2005 11:00 AM
> >To: Lucene Users List
> >Subject: RE: Search Performance
> >
> >Are you creating new IndexSearchers or IndexReaders on each search?
> Caching
> >your IndexSearchers has a dramatic effect on speed.
> >
> >David Townsend
> >
> >-Original Message-
> >From: Michael Celona [mailto:[EMAIL PROTECTED]
> >Sent: 18 February 2005 15:55
> >To: Lucene Users List
> >Subject: Search Performance
> >
> >
> >What is single handedly the best way to improve search performance?  I have
> >an index in the 2G range stored on the local file system of the searcher.
> >Under a load test of 5 simultaneous users my average search time is ~4700
> >ms.  Under a load test of 10 simultaneous users my average search time is
> >~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
> >Zeons.  Any ideas?
> >
> >
> >
> >Michael
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Subversion conversion

2005-02-02 Thread Chris Lamprecht
One thing about subversion branches (from "Key Concepts Behind
Branches" in chapter 4 of the subversion book):

"2. Subversion has no internal concept of a branchâonly copies. When
you copy a directory, the resulting directory is only a "branch"
because you attach that meaning to it. You may think of the directory
differently, or treat it differently, but to Subversion it's just an
ordinary directory that happens to have been created by copying."


On Wed, 2 Feb 2005 19:49:53 -0500, Chakra Yadavalli
<[EMAIL PROTECTED]> wrote:
> Hello ALL, It might not be the right place for it but as we are talking
> about SCM, I have a quick question. First, I haven't used CVS/SVN on any
> project. I am a ClearCase/PVCS guy. I just would like to know WHICH
> CONFIGURATION MANAGEMENT PLAN DO YOU FOLLOW IN LUCENE DEVELOPMENT.
> 
> PLAN A: DEVELOP IN TRUNK AND BRANCH OFF ON RELEASE
> Recently I had a discussion with a friend about developing in the TRUNK
> (which in the /main in ClearCase speak),  which my friend claims that is
> done in the APACHE/Open Source projects. The main advantage he pointed
> was that Merging could be avoided if you are developing in the TRUNK.
> And when there is a release, they create a new Branch (say LUCENE_1.5
> branch) and label them. That branch will be used for maintenance and any
> code deltas will be merged back to TRUNK as needed.
> 
> PLAN B: BRANCH OF BEFORE PLANNED RELEASE AND MERGE BACK TO MAIN/TRUNK
> As I am from a "private workspace"/"isolated development" school of
> thought promoted by ClearCase, I am used to create a branch at the
> project/release initiation and develop in that branch (say /main/dev).
> Similarly, we have /main/int for making changes when the project goes to
> integration phase, and a /main/acp branch for acceptance. In this
> school, the /main will always have fewer versions of files and the
> difference between any two consecutive versions is the NET CHANGE of
> that SCM element (either file or dir) between two releases (say LUCENE
> 1.4 and 1.5).
> 
> Thanks in advance for your time.
> Chakra Yadavalli
> http://jroller.com/page/cyblogue
> 
> > -Original Message-
> > From: aurora [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, February 02, 2005 4:25 PM
> > To: lucene-user@jakarta.apache.org
> > Subject: Re: Subversion conversion
> >
> > Subversion rocks!
> >
> > I have just setup the Windows svn client TortoiseSVN with my favourite
> > file manager Total Commander 6.5. The svn status and commands are
> > readily
> > integrated with the file manager. Offline diff and revert are two things
> > I
> > really like from svn.
> >
> > > The conversion to Subversion is complete.  The new repository is
> > > available to users read-only at:
> > >
> > >   http://svn.apache.org/repos/asf/lucene/java/trunk
> > >
> > > Besides /trunk, there is also /branches and /tags.  /tags contains all
> >
> > > the CVS tags made so that you could grab a snapshot of a previous
> > > version.  /trunk is analogous to CVS HEAD.  You can learn more about
> > the
> > > Apache repository configuration here and how to use the command-line
> > > client to check out the repository:
> > >
> > >   http://www.apache.org/dev/version-control.html
> > >
> > > Learn about Subversion, including the complete O'Reilly Subversion
> > book
> > > in electronic form for free here:
> > >
> > >   http://subversion.tigris.org
> > >
> > > For committers, check out the repository using https and your Apache
> > > username/password.
> > >
> > > The Lucene sandbox has been integrated into our single Subversion
> > > repository, under /java/trunk/sandbox:
> > >
> > >   http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/
> > >
> > > The Lucene CVS repositories have been locked for read-only.
> > >
> > > If there are any issues with this conversion, let me know and I'll
> > bring
> > > them to the Apache infrastructure group.
> > >
> > >   Erik
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> --
> Visit my weblog: http://www.jroller.com/page/cyblogue
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding Fields to Document (with same name)

2005-02-01 Thread Chris Lamprecht
Hi Karl,

>From _Lucene in Action_, section 2.2, when you add the same field with
different values, "Internally, Lucene appends all the words together
and index them in a single Field ..., allowing you to use any of the
given words when searching."

See also http://www.lucenebook.com/search?query=appendable+fields

-chris

On Tue, 1 Feb 2005 11:42:23 +0100 (MET), [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hi,
> 
> what happens when I add two fields with the same name to one Document?
> 
> Document doc = new Document();
> doc.add(Field.Text("bla", "this is my first text"));
> doc.add(Field.Text("bla", "this is my second text"));
> 
> Will the second text overwrite the first, because only one field can be held
> with the same name in one document?
> 
> Will the first and the second text be merged, when I search in the field bla
> (e.g. with query "bla:text") ?
> 
> I am working on XML indexing and did not get an error when having repeated
> XML fields. Now I am wondering...
> 
> Karl
> 
> --
> Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Chris Lamprecht
As they say, nothing lasts forever ;)

I like the idea.  If a project like this gets going, I think I'd be
interested in helping.

The Google mini looks very well done (they have two demos on the web
page).  For $5000, it's probably a very good solution for many
businesses.  If the demos are accurate, it seems like you almost
literally plug it in, configure a few things using the web interface,
and you're in business.   Demos are at
http://www.google.com/enterprise/mini/product_tours_demos.html

-chris

On Thu, 27 Jan 2005 17:40:53 -0800 (PST), Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> I discuss this with myself a lot inside my head... :)
> Seriously, I agree with Erik.  I think this is a business opportunity.
> How many people are hating me now and going "shh"?  Raise your
> hands!
> 
> Otis
> 
> --- David Spencer <[EMAIL PROTECTED]> wrote:
> 
> > This reminds me, has anyone every discussed something similar:
> >
> > - rackmount server ( or for coolness factor, that mini mac)
> > - web i/f for config/control
> >
> > - of course the server would have the following s/w:
> > -- web server
> > -- lucene / nutch
> >
> > Part of the work here I think is having a decent web i/f to configure
> >
> > the thing and to customize the L&F of the search results.
> >
> >
> >
> > jian chen wrote:
> > > Hi,
> > >
> > > I was searching using google and just found that there was a new
> > > feature called "google mini". Initially I thought it was another
> > free
> > > service for small companies. Then I realized that it costs quite
> > some
> > > money ($4,995) for the hardware and software. (I guess the
> > proprietary
> > > software costs a whole lot more than actual hardware.)
> > >
> > > The "nice" feature is that, you can only index up to 50,000
> > documents
> > > with this price. If you need to index more, sorry, send in the
> > > check...
> > >
> > > It seems to me that any small biz will be ripped off if they
> > install
> > > this google mini thing, compared to using Lucene to implement a
> > easy
> > > to use search software, which could search up to whatever number of
> > > documents you could image.
> > >
> > > I hope the lucene project could get exposed more to the enterprise
> > so
> > > that people know that they have not only cheaper but more
> > importantly,
> > > BETTER alternatives.
> > >
> > > Jian
> > >
> > >
> > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reloading an index

2005-01-27 Thread Chris Lamprecht
I just ran into a similar issue.  When you close an IndexSearcher, it
doesn't necessarily close the underlying IndexReader.  It depends
which constructor you used to create the IndexSearcher.  See the
constructors javadocs or source for the details.

In my case, we were updating and optimizing the index from another
process, and reopening IndexSearchers.  We would eventually run out of
disk space because it was leaving open file handles to deleted files,
so the disk space was never being made available, until the JVM
processes ended.  If you're under linux, try running the 'lsof'
command to see if there are any handles to files marked "(deleted)".

-Chris

On Thu, 27 Jan 2005 08:28:30 -0800 (PST), Greg Gershman
<[EMAIL PROTECTED]> wrote:
> I have an index that is frequently updated.  When
> indexing is completed, an event triggers a new
> Searcher to be opened.  When the new Searcher is
> opened, incoming searches are redirected to the new
> Searcher, the old Searcher is closed and nulled, but I
> still see about twice the amount of memory in use well
> after the original searcher has been closed.   Is
> there something else I can do to get this memory
> reclaimed?  Should I explicitly call garbarge
> collection?  Any ideas?
> 
> Thanks.
> 
> Greg Gershman
> 
> __
> Do you Yahoo!?
> Meet the all-new My Yahoo! - Try it today!
> http://my.yahoo.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching with words that contain % , / and the like

2005-01-27 Thread Chris Lamprecht
Without looking at the source, my guess is that StandardAnalyzer (and
StandardTokenizer) is the culprit.  The StandardAnalyzer grammar (in
StandardTokenizer.jj) is probably defined so "x/y" parses into two
tokens, "x" and "y".  "s" is a default stopword (see
StopAnalyzer.ENGLISH_STOP_WORDS), so it gets filtered out, while "p"
does not.

To get what you want, you can use a WhitespaceAnalyzer, write your own
custom Analyzer or Tokenizer, or modify the StandardTokenizer.jj
grammar to suit your needs.  WhitespaceAnalyzer is much simpler than
StandardAnalyzer, so you may see some other things being tokenized
differently.

-Chris

On Thu, 27 Jan 2005 12:12:16 +0530, Robinson Raju
<[EMAIL PROTECTED]> wrote:
> Hi ,
> 
> Is there a way to search for words that contain "/" or "%" .
> if my query is "test/s" , it is just taken as "test"
> if my query is "test/p" , it is just taken as "test p"
> has anyone done this / faced such an issue ?
> 
> Regards
> Robin
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE + EXCEPTION

2005-01-24 Thread Chris Lamprecht
Hi Karthik,

If you are talking about SingleThreadModel (i.e. your servlet
implements javax.servlet.SingleThreadModel), this does not guarantee
that two different instances of your servlet won't be run at the same
time.  It only guarantees that each instance of your servlet will only
be run by one thread at a time.  See:

http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/SingleThreadModel.html

If you are accessing a shared resource (a lucene index), you'll have
to prevent concurrent modifications somehow other than
SingleThreadModel.

I think they've finally deprecated SingleThreadModel in the latest
(may be not even out yet) servlet spec.

-chris

 
> On STANDALONE Usge of   UPDATION/DELETION/ADDITION of Documents into
> MergerIndex, the  Code of mine
> 
> 
> runs PERFECTLY  with out any Problems.
> 
> 
> But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet
> Running in SINGLE THREAD MODE,Some times 
> 
> 
> Frequently I get the Error as below

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-21 Thread Chris Lamprecht
Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
<[EMAIL PROTECTED]> wrote:
> OK, OK ... I'll buy the book. I guess its about time since I am deeply
> and forever in love with Lucene. Might as well take the final plunge.
> 
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 21, 2005 9:12 AM
> To: Lucene Users List
> Subject: Re: Stemming
> 
> Hi Kevin,
> 
> Stemming is an optional operation and is done in the analysis step.
> Lucene comes with a Porter stemmer and a Filter that you can use in an
> Analyzer:
> 
> ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
> ./src/java/org/apache/lucene/analysis/PorterStemmer.java
> 
> You can find more about it here:
> http://www.lucenebook.com/search?query=stemming
> You can also see mentions of SnowballAnalyzer in those search results,
> and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
> 
> Otis
> 
> --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:
> 
> > I want to understand how Lucene uses stemming but can't find any
> > documentation on the Lucene site. I'll continue to google but hope
> > that
> > this list can help narrow my search. I have several questions on the
> > subject currently but hesitate to list them here since finding a good
> > document on the subject may answer most of them.
> >
> >
> >
> > Thanks in advance for any pointers,
> >
> >
> >
> > Kevin
> >
> >
> >
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer unit tests?

2005-01-17 Thread Chris Lamprecht
Erik, Paul, Daniel,

I submitted a testcase --
http://issues.apache.org/bugzilla/show_bug.cgi?id=33134

On a related note, what do you all think about updating the
StandardAnalyzer grammar to treat "C#" and "C++" as tokens?  It's a
small modification to the grammar -- NutchAnalysis.jj has it.

-Chris

On Mon, 17 Jan 2005 03:23:41 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> I don't see any tests of StandardAnalyzer either.  Your contribution
> would be most welcome.  There are tests that use StandardAnalyzer, but
> not to test it directly.
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer unit tests?

2005-01-16 Thread Chris Lamprecht
PS-I didn't find any in lucene CVS head, and I'd be glad to contribute
some unit tests.


> Does anyone have a unit test for StandardAnalyzer?  I've modified the

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



StandardAnalyzer unit tests?

2005-01-16 Thread Chris Lamprecht
Does anyone have a unit test for StandardAnalyzer?  I've modified the
StandardAnalyzer javacc grammar to tokenize "c#" and "c++" without
removing the "#" and "++" parts, using pieces of the grammar from
Nutch.  Now I'd like to make sure I didn't change the way it parses
any other tokens.  thanks,

-Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Chris Lamprecht
What about a shutdown hook?
  
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() { /* whatever */ }
});

see also http://www.onjava.com/pub/a/onjava/2003/03/26/shutdownhook.html


On Tue, 11 Jan 2005 13:21:42 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Joseph Ottinger wrote:
> > As one for whom the question's come up recently, I'd say that locks need
> > to be terminated gracefully, instead. I've noticed a number of cases where
> > the locks get abandoned in exceptional conditions, which is almost exactly
> > what you don't want.
> 
> The problem is that this is hard to do from Java.  A typical approach is
> to put the process id in the lock file, then, if that process is dead,
> ignore the lock file.  But Java does not let one know process ids.  Java
> 1.4 provides a LockFile mechanism which should mostly solve this, but
> Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that
> feature.  Lucene 2.0 is likely to require Java 1.4 and should be able to
> do a better job of automatically unlocking indexes when processes die.
> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread Chris Lamprecht
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds).  So
if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.

On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
<[EMAIL PROTECTED]> wrote:
> 
> Google just came out with a page that gives you feedback as to how many
> pages will match your query and variations on it:
> 
> http://www.google.com/webhp?complete=1&hl=en
> 
> I had an unexposed experiment I had done with Lucene a few months ago
> that this has inspired me to expose - it's not the same, but it's
> similar in that as you type in a query you're given *immediate* feedback
> as to how many pages match.
> 
> Try it here: http://www.searchmorph.com/kat/isearch.html
> 
> This is my "SearchMorph" site which has an index of ~90k pages of open
> source javadoc packages.
> 
> As you type in a query, on every keystroke it does at least one Lucene
> search to show results in the bottom part of the page.
> 
> It also gives spelling corrections (using my "NGramSpeller"
> contribution) and also suggests popular tokens that start the same way
> as your search query.
> 
> For one way to see corrections in action, type in "rollback" character
> by character (don't do a cut and paste).
> 
> Note that:
> -- this is not how the Google page works - just similar to it
> -- I do single word suggestions while google does the more useful whole
> phrase suggestions (TBD I'll try to copy them)
> -- They do lots of javascript magic, whereas I use old school frames mostly
> -- this is relatively expensive, as it does 1 query per character, and
> when it's doing spelling correction there is even more work going on
> -- this is just an experiment and the page may be unstable as I fool w/ it
> 
> What's nice is when you get used to immediate results, going back to the
> "batch" way of searching seems backward, slow, and old fashioned.
> 
> There are too many idle CPUs in the world - this is one way to keep them
> busier :)
> 
> -- Dave
> 
> PS Weblog entry updated too:
> http://www.searchmorph.com/weblog/index.php?id=26
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many open files issue

2004-11-22 Thread Chris Lamprecht
A useful resource for increasing the number of file handles on various
operating systems is the Volano Report:

http://www.volano.com/report/

> I had requested help on an issue we have been facing with the "Too many
> open files" Exception garbling the search indexes and crashing the
> search on the web site.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread Chris Lamprecht
John,

It actually should be pretty easy to use just the parts of Lucene you
want (the analyzers, etc) without using the rest.  See the example of
the PorterStemmer from this article:

http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2

You could feed a Reader to the tokenStream() method of
PorterStemAnalyzer, and get back a TokenStream, from which you pull
the tokens using the next() method.



On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> 
> Is there a way to use Lucene stemming and stop word removal without using the 
> rest of the tool?   I am downloading the code now, but I imagine the answer 
> might be deeply burried.  I would like to be able to send in a phrase and get 
> back a collection of keywords if possible.
> 
> I am thinking of using an intermediary solution before moving fully to 
> Lucene.  I don't have time to spend a month making a carefully tested, 
> administratable Lucene solution for my site yet, but I intend to do so over 
> time.  Funny thing is the Lucene code likely would only take up a couple 
> hundred of lines, but integration and administration would take me much more 
> time.
> 
> In the meantime, I am thinking I could use perhaps Lucene steming and parsing 
> of words, then stick each search word along with the associated primary key 
> in an indexed MySql table.   Each record I would need to do this to is small 
> with maybe only average 15 userful words.   I would be able to have an 
> in-database solution though ranking, etc would not exist.   This is better 
> then the exact word searching i have currently which is really bad.
> 
> By the way, MySql 4.1.1 has some Lucene type handling, but it too does not 
> have stemming and I am sure it is very slow compaired to Lucene.   Cpanel is 
> still stuck on MySql 4.0.* so many people would not have access to even this 
> basic ability in production systems for some time yet.
> 
> JohnE
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Locking Issues Resolved...I hope

2004-11-16 Thread Chris Lamprecht
MySQL does offer a basic fulltext search (with MyISAM tables), but it
doesn't really approach the functionality of Lucene, such as pluggable
tokenizers, stemming, etc.  I think MS SQL server has fulltext search
as well, but I have no idea if it's any good.

See http://www.google.com/search?hl=en&lr=&safe=off&c2coff=1&q=mysql+fulltext

> I have not seen clear yet because it is all new.   I wish a database Text 
> field could have this sort of mechanism built into it.   MySql does not do 
> this (what I am using), but I am going to check into other databases now.  
> OJB will work with most all of them so that would help if there is a database 
> type of solution that will allow that sleep at night thing to happen!!!
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
Nader and Chuck,

Thanks for the responses, they're both helpful.  My index sizes will
begin on the order of 200,000 classes, and 20,000 instructors (and
much fewer departments), and grow over time to maybe a few million
classes.  Compared to some of the numbers I've seen on this mailing
list, my dataset is fairly small.  I think I'll not worry about
performance for now, until & unless it becomes an issue.

-Chris

On Sat, 13 Nov 2004 15:36:11 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote:
> My Lucene application includes multi-faceted navigation that does a more
> complex version of the below.  I've got 5 different taxonomies into
> which every indexed item is classified.  The largest of the taxonomies
> has over 15,000 entries while the other 4 are much smaller. For every
> search query, I determine the best small set of nodes from each taxonomy
> to present to the user as drill down options, and provide the counts
> regarding how many results fall under each of these nodes.  At present I
> only have about 25,000 indexed objects and usually no more than 1,000
> results from the initial query.  To determine the drill-down options and
> counts, I scan up to 1,000 results computing the counts for all nodes
> into which these results classify.  Then for each taxonomy I pick the
> best drill-down options available (orthogonal set with reasonable
> branching factor that covers all results) and present them with their
> counts.  If there are more than 1,000 results, I extrapolate the
> computed counts to estimate the actual counts on the entire set of
> results.  This is all done with a single index and a single search.
> 
> The total time required for performing this computation for the one
> large taxonomy is under 10ms, running in full debug mode in my ide.  The
> query response time overall is subjectively instantaneous at the UI
> (Google-speed or better).  So, unless some dimension of the problem is
> much bigger than mine, I doubt performance will be an issue.
> 
> Chuck
> 
> 
> 
>  > -Original Message-
>  > From: Nader Henein [mailto:[EMAIL PROTECTED]
>  > Sent: Saturday, November 13, 2004 2:29 AM
>  > To: Lucene Users List
>  > Subject: Re: How to efficiently get # of search results, per
> attribute
>  >
>  > It depends on how many results they're looking through, here are two
>  > scenarios I see:
>  >
>  > 1] If you don't have that many records you can fetch all the results
> and
>  > then do a post parsing step the determine totals
>  >
>  > 2] If you have a lot of entries in each category and you're worried
>  > about fetching thousands of records every time, you can just have
>  > seperate indecies per category and search them in in parallel (not
>  > Lucene Parallel Search) and you can get up to 100 hits for each one
>  > (efficiency) but you'll also have the total from the search to
> display.
>  >
>  > Either way you can boost up speed using RamDirectory if you need
> more
>  > speed from the search, but whichever approach you choose I would
>  > recommend that you sit down and do some number crunching to figure
> out
>  > which way to go.
>  >
>  >
>  > Hope this helps
>  >
>  > Nader Henein
>  >
>  >
>  >
>  > Chris Lamprecht wrote:
>  >
>  > >I'd like to implement a search across several types of "entities",
>  > >let's say, classes, professors, and departments.  I want the user
> to
>  > >be able to enter a simple, single query and not have to specify
> what
>  > >they're looking for.  Then I want the search results to be
> something
>  > >like this:
>  > >
>  > >Search results for: "philosophy boyer"
>  > >
>  > >Found: 121 classes - 5 professors - 2 departments
>  > >
>  > >
>  > >
>  > >
>  > >I know I could iterate through every hit returned and count them up
>  > >myself, but that seems inefficient if there are lots of results.
> Is
>  > >there some other way to get this kind of information from the
> search
>  > >result set?  My other ideas are: doing a separate search each
> result
>  > >type, or storing different types in different indexes.  Any
>  > >suggestions?  Thanks for your help!
>  > >
>  > >-Chris
>  > >
>  >
> >-
>  > >To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > >For additional commands, e-mail:
> [EMAIL PROTECTED]
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  >
>  >
> -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
I'd like to implement a search across several types of "entities",
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:

Search results for: "philosophy boyer"

Found: 121 classes - 5 professors - 2 departments




I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!

-Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]