list moving to lucene.apache.org

2005-03-01 Thread Roy T . Fielding
This list is about to be moved to java-user at lucene.apache.org.
Please excuse the temporary inconvenience.
Cheers,
Roy T. Fielding, co-founder, The Apache Software Foundation
 ([EMAIL PROTECTED])  
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple indexes

2005-03-01 Thread Otis Gospodnetic
Ben,

You do need to use a separate instance of those 3 classes for each
index yes.  But this is really something like:

IndexWriter writer = new IndexWriter();

So it's normal code-writing process you don't really have to create
anything new, just use existing Lucene API.  As for locking, again you
don't need to create anything.  Lucene does have a locking mechanism,
but most of it should be completely invisible to you if you follow the
concurrency rules.

I hope this helps.

Otis

--- Ben <[EMAIL PROTECTED]> wrote:

> Is it true that for each index I have to create a seperate instance
> for FSDirectory, IndexWriter and IndexReader? Do I need to create a
> seperate locking mechanism as well?
> 
> I have already implemented a program using just one index.
> 
> Thanks,
> Ben
> 
> On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
> <[EMAIL PROTECTED]> wrote:
> > It's hard to answer such a general question with anything very
> precise,
> > so sorry if this doesn't hit the mark.  Come back with more details
> and
> > we'll gladly assist though.
> > 
> > First, certainly do not copy/paste code.  Use standard reuse
> practices,
> > perhaps the same program can build the two different indexes if
> passed
> > different parameters, or share code between two different programs
> as a
> > JAR.
> > 
> > What specifically are the issues you're encountering?
> > 
> > Erik
> > 
> > 
> > On Mar 1, 2005, at 8:06 PM, Ben wrote:
> > 
> > > Hi
> > >
> > > My site has two types of documents with different structure. I
> would
> > > like to create an index for each type of document. What is the
> best
> > > way to implement this?
> > >
> > > I have been trying to implement this but found out that 90% of
> the
> > > code is the same.
> > >
> > > In Lucene in Action book, there is a case study on jGuru, it just
> > > mentions them using multiple indexes. I would like to do
> something
> > > like them.
> > >
> > > Any resources on the Internet that I can learn from?
> > >
> > > Thanks,
> > > Ben
> > >
> > >
> -
> > > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Doug Cutting
Yonik Seeley wrote:
6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.
This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.
A clever way to do this is to take advantage of Lucene's index file 
structure.  Indexes are directories of files.  As the index changes 
through additions and deletions most files in the index stay the same. 
So you can efficiently synchronize multiple copies of an index by only 
copying the files that change.

The way I did this for Technorati was to:
1. On the index master, periodically checkpoint the index.  Every minute 
or so the IndexWriter is closed and a 'cp -lr index index.DATE' command 
is executed from Java, where DATE is the current date and time.  This 
efficiently makes a copy of the index when its in a consistent state by 
constructing a tree of hard links.  If Lucene re-writes any files (e.g., 
the segments file) a new inode is created and the copy is unchanged.

2. From a crontab on each search slave, periodically poll for new 
checkpoints.  When a new index.DATE is found, use 'cp -lr index 
index.DATE' to prepare a copy, then use 'rsync -W --delete 
master:index.DATE index.DATE' to get the incremental index changes. 
Then atomically install the updated index with a symbolic link (ln -fsn 
index.DATE index).

3. In Java on the slave, re-open 'index' it when its version changes. 
This is best done in a separate thread that periodically checks the 
index version.  When it changes, the new version is opened, a few 
typical queries are performed on it to pre-load Lucene's caches.  Then, 
in a synchronized block, the Searcher variable used in production is 
updated.

4. In a crontab on the master, periodically remove the oldest checkpoint 
indexes.

Technorati's Lucene index is updated this way every minute.  A 
mergeFactor of 2 is used on the master in order to minimize the number 
of segments in production.  The master has a hot spare.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple indexes

2005-03-01 Thread Ben
Is it true that for each index I have to create a seperate instance
for FSDirectory, IndexWriter and IndexReader? Do I need to create a
seperate locking mechanism as well?

I have already implemented a program using just one index.

Thanks,
Ben

On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> It's hard to answer such a general question with anything very precise,
> so sorry if this doesn't hit the mark.  Come back with more details and
> we'll gladly assist though.
> 
> First, certainly do not copy/paste code.  Use standard reuse practices,
> perhaps the same program can build the two different indexes if passed
> different parameters, or share code between two different programs as a
> JAR.
> 
> What specifically are the issues you're encountering?
> 
> Erik
> 
> 
> On Mar 1, 2005, at 8:06 PM, Ben wrote:
> 
> > Hi
> >
> > My site has two types of documents with different structure. I would
> > like to create an index for each type of document. What is the best
> > way to implement this?
> >
> > I have been trying to implement this but found out that 90% of the
> > code is the same.
> >
> > In Lucene in Action book, there is a case study on jGuru, it just
> > mentions them using multiple indexes. I would like to do something
> > like them.
> >
> > Any resources on the Internet that I can learn from?
> >
> > Thanks,
> > Ben
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Yonik Seeley
> 6. Index locally and synchronize changes periodically. This is an
> interesting idea and bears looking into. Lucene can combine multiple
> indexes into a single one, which can be written out somewhere else, and
> then distributed back to the search nodes to replace their existing
> index.

This is a promising idea for handling a high update volume because it
avoids all of the search nodes having to do the analysis phase.

Unfortunately, the way addIndexes() is implemented looks like it's
going to present some new problems:

  public synchronized void addIndexes(Directory[] dirs)
  throws IOException {
optimize();   // start with zero or 1 seg
for (int i = 0; i < dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j < sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
optimize();   // final cleanup
  }

We need to deal with some very large indexes (40G+), and an optimize
rewrites the entire index, no matter how few documents were added. 
Since our strategy calls for deleting some docs on the primary index
before calling addIndexes() this means *both* calls to optimize() will
end up rewriting the entire index!

The ideal behavior would be that of addDocument() - segments are only
merged occasionally.   That said, I'll throw out a replacement
implementation that probably doesn't work, but hopefully will spur
someone with more knowledge of Lucene internals to take a look at
this.

  public synchronized void addIndexes(Directory[] dirs)
  throws IOException {
// REMOVED: optimize();
for (int i = 0; i < dirs.length; i++) {
  SegmentInfos sis = new SegmentInfos();  // read infos from dir
  sis.read(dirs[i]);
  for (int j = 0; j < sis.size(); j++) {
segmentInfos.addElement(sis.info(j)); // add each info
  }
}
maybeMergeSegments();   // replaces optimize
  }

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple indexes

2005-03-01 Thread Erik Hatcher
It's hard to answer such a general question with anything very precise, 
so sorry if this doesn't hit the mark.  Come back with more details and 
we'll gladly assist though.

First, certainly do not copy/paste code.  Use standard reuse practices, 
perhaps the same program can build the two different indexes if passed 
different parameters, or share code between two different programs as a 
JAR.

What specifically are the issues you're encountering?
Erik
On Mar 1, 2005, at 8:06 PM, Ben wrote:
Hi
My site has two types of documents with different structure. I would
like to create an index for each type of document. What is the best
way to implement this?
I have been trying to implement this but found out that 90% of the
code is the same.
In Lucene in Action book, there is a case study on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.
Any resources on the Internet that I can learn from?
Thanks,
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: How to manipulate the lucene index table

2005-03-01 Thread Kyong Kwak
You can try Luke
http://www.getopt.org/luke/ 

-Original Message-
From: Srimant Mishra [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 01, 2005 4:39 PM
To: lucene-user@jakarta.apache.org
Subject: How to manipulate the lucene index table

Hi all, 

 

 I have a web-based application that we use to index text
documents as well as images; the indexes fields are either
Field.Unstored or Field.Keyword. 

 

 

 Currently, we plan to modify some of the index field names. For
example, if the index field name was DOCLOCALE, we plan to break it up
into two fields: DOCUMENTTYPE and LOCALE. Since, the index files that
lucene creates have become quite big (close to 1 gig), we are looking
for a way to be able to read the index entries and modify them via a
standalone Java program.

 

Does lucene provide any APIs to read these index entries and update
them? Is there an easy way to do it?

 

Thanks in advance

Srimant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to manipulate the lucene index table

2005-03-01 Thread Srimant Mishra
Hi all, 

 

 I have a web-based application that we use to index text documents
as well as images; the indexes fields are either Field.Unstored or
Field.Keyword. 

 

 

 Currently, we plan to modify some of the index field names. For
example, if the index field name was DOCLOCALE, we plan to break it up into
two fields: DOCUMENTTYPE and LOCALE. Since, the index files that lucene
creates have become quite big (close to 1 gig), we are looking for a way to
be able to read the index entries and modify them via a standalone Java
program.

 

Does lucene provide any APIs to read these index entries and update them? Is
there an easy way to do it?

 

Thanks in advance

Srimant



Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Chris Hostetter
: We have a requirement for a new version of our software that it run in a
: clustered environment. Any node should be able to go down but the
: application must keep functioning.

My application is looking at similar problems.  We aren't yet live, but
the only practicle solution we have implimented so far is the "apply all
adds/deletes to all instances in parallel or sequence" model which we
don't really like very much.  I don't consider it a viable option for our
launch given the volume of updates we need to be able to handle in a
timely manor.

I'm also curious as to what ideas people on this list have about realiable
index replication.  I've included my thoughts on some of the possible
solutions below...


: 2. Don't distribute indexing. Searching is distributed by storing the
: index on NFS. A single indexing node would process all requests.
: However, using Lucene on NFS is *not* recommended. See:

I don't really consider reading/writing to an NFS mounted FSDirectory to
be viable for the very reasons you listed; but I haven't really found any
evidence of problems if you take they approach that a single "writer"
node indexes to local disk, which is NFS mounted by all of your other
nodes for doing queries.  concurent updates/queries may still not be safe
(i'm not sure) but you could have the writer node "clone" the entire index
into a new directory, apply the updates and then signal the other nodes to
stop using the old FSDirectory and start using the new one.

: 3. Distribute indexing and searching into separate indexes for each
: node. Combine results using ParallelMultiSearcher. If a box went down, a
: piece of the index would be unavailable. Also, there would be serious

I haven't really considered this option because it would be unacceptable
for my application.

: 4. Distribute indexing and searching, but index everything at each node.
: Each node would have a complete copy of the index. Indexing would be
: slower. We could move to a 5 or 15 minute batch approach.

As i said, tis is our current "last resort" but there are some serious
issues i worry baout with this under high concurrent update/query load.
they are the same issues you would face if you only had one box -- but
frankly one of the main oals i see for a distributed solution is to reduce
the total amount of processing that needs to be done -- not multiply it by
the number of boxes, so i'm trying to find something better.

: 5. Index centrally and push updated indexes to search nodes on a
: periodic basis. This would be easy and might avoid the problems with
: using NFS.
:
: 6. Index locally and synchronize changes periodically. This is an
: interesting idea and bears looking into. Lucene can combine multiple
: indexes into a single one, which can be written out somewhere else, and
: then distributed back to the search nodes to replace their existing
: index.

Agreed.  These are two of the most promising ideas we're currently
considering, but we haven't acctually tried implimenting yet.  The other
thing we have considered is having a pool of "updater" nodes which process
batches of additions into a small index, which is then copied out to all
of hte other nodes.  these nodes can then either Multisearch between their
existing index and the new one, or they can acctally merge the new one
in (based on their current load).

The concern i have with approaches like this, is that they still require
the individual nodes to all duplicate work of merging, and ultimately:
optimizing.  that's something i don't wnat them to have to do, especially
under potentially heavy query load.

What i'd really like is a single "primary indexer" box, that builds up
lots of small RAMDirectory indexes as updates come in, and periodically
writes them to files to be copied over to "warm standby indexer" boxes.
All of the indexer boxes eventually merge these small indexes into the
master, which is versioned on a regular basis.  The primary indexer would
also be the main guy to decide how often to do an optimize()

if the primary indexer goes down, and of the warm standy indexers can take
over with minimal loss of updates.

Then the various "query boxes" can periodically copy the most recent rev
of the index over whenever they want, close their existing IndexReader and
open a new one poited at the new rev.

Problems that come up:

  1) for indexes big enough to warant these kinds of realiability
 concerns, you need a lot of bandwidth to copy that much data arround.
  2) our application has an expecation that issuing the same query to two
 different nodes in the cluster at the same time should give you the
 same results.  In order for that to be true, in an approach like the
 one i described would reuire some coordination mechanism to know what
 the highest rev# of the index had been copied to all of boxes and
 then signal them all to start using that rev at the same time.




-Hoss


--

Multiple indexes

2005-03-01 Thread Ben
Hi

My site has two types of documents with different structure. I would
like to create an index for each type of document. What is the best
way to implement this?

I have been trying to implement this but found out that 90% of the
code is the same.

In Lucene in Action book, there is a case study on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.

Any resources on the Internet that I can learn from?

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Daniel Naber wrote:
After fixing this I can reproduce the problem with a local index that 
contains about 220.000 documents (700MB). Fetching the first document 
takes for example 30ms, fetching the last one takes >100ms. Of course I 
tested this with a query that returns many results (about 50.000). 
Actually it happens even with the default sorting, no need to sort by some 
specific field.
In part this is due to the fact that Hits first searches for the 
top-scoring 100 documents.  Then, if you ask for a hit after that, it 
must re-query.  In part this is also due to the fact that maintaining a 
queue of the top 50k hits is more expensive than maintaining a queue of 
the top 100 hits, so the second query is slower.  And in part this could 
be caused by other things, such as that the highest ranking document 
might tend to be cached and not require disk io.

One could perform profiling to determine which is the largest factor. 
Of these, only the first is really fixable: if you know you'll need hit 
50k then you could tell this to Hits and have it perform only a single 
query.  But the algorithmic cost of keeping the queue of the top 50k is 
the same as collecting all the hits and sorting them.  So, in part, 
getting hits 49,990 through 50,000 is inherently slower than getting 
hits 0-10.  We can minimize that, but not eliminate it.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Best Practices for Distributing Lucene Indexing and Searching

2005-03-01 Thread Luke Francl
Lucene Users,

We have a requirement for a new version of our software that it run in a
clustered environment. Any node should be able to go down but the
application must keep functioning.

Currently, we use Lucene on a single node but this won't meet our fail
over requirements. If we can't find a solution, we'll have to stop using
Lucene and switch to something else, like full text indexing inside the
database.

So I'm looking for best practices on distributing Lucene indexing and
searching. I'd like to hear from those of you using Lucene in a
multi-process environment what is working for you. I've done some
research, and based on on what I've seen so far, here's a bit of
brainstorming on what seems to be possible:

1. Don't. Have a single indexing and searching node. [Note: this is the
last resort.]

2. Don't distribute indexing. Searching is distributed by storing the
index on NFS. A single indexing node would process all requests.
However, using Lucene on NFS is *not* recommended. See:
http://lucenebook.com/search?query=nfs ...it can result in "stale NFS
file handle" problem:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12481.html
So we'd have to investigate this option. Indexing could use an JMS queue
so if the box goes down, when it comes back up, indexing could resume
where it left off.

3. Distribute indexing and searching into separate indexes for each
node. Combine results using ParallelMultiSearcher. If a box went down, a
piece of the index would be unavailable. Also, there would be serious
issues making sure assets are indexed in the right place to prevent
duplicates, stale results, or deleted assets from showing up in the
index. Another possibility would be a hashing scheme for
indexing...assets could be put into buckets based on their
IDs to prevent duplication. Keeping results consistent as you're
changing the number of the buckets as the nodes come up and down would
be a challenge though

4. Distribute indexing and searching, but index everything at each node.
Each node would have a complete copy of the index. Indexing would be
slower. We could move to a 5 or 15 minute batch approach.

5. Index centrally and push updated indexes to search nodes on a
periodic basis. This would be easy and might avoid the problems with
using NFS.

6. Index locally and synchronize changes periodically. This is an
interesting idea and bears looking into. Lucene can combine multiple
indexes into a single one, which can be written out somewhere else, and
then distributed back to the search nodes to replace their existing
index.

7. Create a JDBCDirectory implementation and let the database handle the
clustering. A JDBCDirectory exists
(http://ppinew.mnis.com/jdbcdirectory/), but has only been tested with
MySQL. It would probably require modification (the code is under the
LGPL). At one time, an OracleDirectory implementation existed but that
was in 2000 and so it is surely badly outdated. But in principle, the
concept is possible. However, these database-based directories are
slower at indexing and searching than the traditional style, probably
mostly due to BLOB handling.

8. Can the Berkely DB-based DBDirectory help us? I am not sure what
advantages it would bring over the traditional FSDirectory, but maybe
someone else has some ideas.

Please let me know if you've got any other ideas or a best practice to
follow.

Thanks,
Luke Francl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Investingating Lucene For Project

2005-03-01 Thread Runde, Kevin
Also there is a book called "Lucene in Action" that was released
recently. It is a great introduction to Lucene and has sections
dedicated to indexing different text document types (txt, html, pdf,
doc, rtf). FYI I am in no way related to the book or the authors so this
is a real recommendation. It will help you quickly learn what Lucene is
and can do. It has lots of pointers to other projects that use Lucene or
expand upon it's functionality.

Thanks,
Kevin 

-Original Message-
From: Ben Litchfield [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 01, 2005 3:08 PM
To: Lucene Users List
Subject: Re: Investingating Lucene For Project 


See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future.
The
> PDF files live on a server and are in a structured environment. I
would
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a
text
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create
that
yourself.

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve?
or
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but
it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Investingating Lucene For Project

2005-03-01 Thread Ben Litchfield

See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future. The
> PDF files live on a server and are in a structured environment. I would
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a text
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create that
yourself.

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve? or
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Investingating Lucene For Project

2005-03-01 Thread Scott Purcell
I am looking for a solution to a problem I am having. We have a web-based asset 
management solution where we manage customers assets.
 
We have had requests from some clients who would like the ability to "index"  
PDF files, now and possibly other text files in the future. The PDF files live 
on a server and are in a structured environment. I would like to somehow index 
the content inside the PDF and be able to run searches on that information from 
a web-form. The result MUST BE  a text snippet (that being some text prior to 
the searched word and after the searched word). 
Does this make sense? And can Lucene do this?
 
If the product can do this, how is the best way to get rolling on a project of 
this nature? Purchase an example book, or are there simple examples one can 
pick up on? Does Lucene have a large learning curve? or reasonably quick?
 
If all the above will work, what kind of license does this require? I have not 
been able to find a link to that yet on the jakarta site.
 
I sincerely appreciate any input into this.
 
Sincerely
Scott 
 


Re: Zip Files

2005-03-01 Thread Chris Lamprecht
Luke,

Look at the javadocs for java.io.ByteArrayInputStream - it wraps a
byte array and makes it accessible as an InputStream.  Also see
java.util.zip.ZipFile.  You should be able to read and parse all
contents of the zip file in memory.

http://java.sun.com/j2se/1.4.2/docs/api/java/io/ByteArrayInputStream.html


On Tue, 1 Mar 2005 12:39:17 -0500, Luke Shannon
<[EMAIL PROTECTED]> wrote:
> Thanks Ernesto.
> 
> I'm struggling with how I can work with an  array of bytes  instead of a
> Java File.
> 
> It would be easier to unzip the zip to a temp directory, parse the files and
> than delete the directory. But this would greatly slow indexing and use up
> disk space.
> 
> Luke
> 
> - Original Message -
> From: "Ernesto De Santis" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Tuesday, March 01, 2005 10:48 AM
> Subject: Re: Zip Files
> 
> > Hello
> >
> > first, you need a parser for each file type: pdf, txt, word, etc.
> > and use a java api to iterate zip content, see:
> >
> > http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
> >
> > use getNextEntry() method
> >
> > little example:
> >
> > ZipInputStream zis = new ZipInputStream(fileInputStream);
> > ZipEntry zipEntry;
> > while(zipEntry = zis.getNextEntry() != null){
> > //use zipEntry to get name, etc.
> > //get properly parser for current entry
> > //use parser with zis (ZipInputStream)
> > }
> >
> > good luck
> > Ernesto
> >
> > Luke Shannon escribió:
> >
> > >Hello;
> > >
> > >Anyone have an ideas on how to index the contents within zip files?
> > >
> > >Thanks,
> > >
> > >Luke
> > >
> > >
> > >-
> > >To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > >
> > >
> >
> > --
> > Ernesto De Santis - Colaborativa.net
> > Córdoba 1147 Piso 6 Oficinas 3 y 4
> > (S2000AWO) Rosario, SF, Argentina.
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fast access to a random page of the search results.

2005-03-01 Thread Daniel Naber
On Tuesday 01 March 2005 19:15, Doug Cutting wrote:

> 'nHits - nHits' always equals zero. ÂSo you're actually printing the
> first document, not the last. ÂThe last document would be accessed with
> 'hits.doc(nHits)'.

After fixing this I can reproduce the problem with a local index that 
contains about 220.000 documents (700MB). Fetching the first document 
takes for example 30ms, fetching the last one takes >100ms. Of course I 
tested this with a query that returns many results (about 50.000). 
Actually it happens even with the default sorting, no need to sort by some 
specific field.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom filters & document numbers

2005-03-01 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?
Documents are only re-numbered after there have been deletions.  Once 
there have been deletions, renumbering may be triggered by any document 
addition or index optimization.  Once an index is optimized, no 
renumbering will be performed unril more deletions are made.

If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.
If you require a persistent document id that survives deletions, then 
add it as a field to your documents.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fast access to a random page of the search results.

2005-03-01 Thread Doug Cutting
Stanislav Jordanov wrote:
startTs = System.currentTimeMillis();
dummyMethod(hits.doc(nHits - nHits));
stopTs = System.currentTimeMillis();
System.out.println("Last doc accessed in " + (stopTs -
startTs)
+ "ms");
'nHits - nHits' always equals zero.  So you're actually printing the 
first document, not the last.  The last document would be accessed with 
'hits.doc(nHits)'.  Accessing the last document should not be much 
slower (or faster) than accessing the first.

200+ milliseconds to access a document does seem slow.  Where is you 
index stored?  On a local hard drive?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Zip Files

2005-03-01 Thread Crump, Michael
Not sure what you are using as your indexing classes but if you changed them to 
use InputStream I think it would go a long way towards making them more 
flexible and solving your problem.

> -Original Message-
> From: Luke Shannon [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 01, 2005 12:39 PM
> To: Lucene Users List
> Subject: Re: Zip Files
> 
> Thanks Ernesto.
> 
> The issue I'm working with now (this is more lack of experience than
> anything) is getting an input I can index. All my indexing classes (doc,
> pdf, xml, ppt) take a File object as a parameter and return a Lucene
> Document containing all the fields I need.
> 
> I'm struggling with how I can work with an  array of bytes  instead of a
> Java File.
> 
> It would be easier to unzip the zip to a temp directory, parse the files
> and
> than delete the directory. But this would greatly slow indexing and use up
> disk space.
> 
> Luke
> 
> - Original Message -
> From: "Ernesto De Santis" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Tuesday, March 01, 2005 10:48 AM
> Subject: Re: Zip Files
> 
> 
> > Hello
> >
> > first, you need a parser for each file type: pdf, txt, word, etc.
> > and use a java api to iterate zip content, see:
> >
> >
> http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
> >
> > use getNextEntry() method
> >
> > little example:
> >
> > ZipInputStream zis = new ZipInputStream(fileInputStream);
> > ZipEntry zipEntry;
> > while(zipEntry = zis.getNextEntry() != null){
> > //use zipEntry to get name, etc.
> > //get properly parser for current entry
> > //use parser with zis (ZipInputStream)
> > }
> >
> > good luck
> > Ernesto
> >
> > Luke Shannon escribió:
> >
> > >Hello;
> > >
> > >Anyone have an ideas on how to index the contents within zip files?
> > >
> > >Thanks,
> > >
> > >Luke
> > >
> > >
> > >-
> > >To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > >
> > >
> >
> > --
> > Ernesto De Santis - Colaborativa.net
> > Córdoba 1147 Piso 6 Oficinas 3 y 4
> > (S2000AWO) Rosario, SF, Argentina.
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-03-01 Thread Doug Cutting
Kevin A. Burton wrote:
BTW.. can you define "a bit"...
Merriam-Webster says:
  a bit : SOMEWHAT, RATHER
Is "a bit" 5%?  10%?  Benchmarks would be ncie but I'm not that picky.  
If you want benchmarks, make benchmarks.
I just want to see what performance hits/benefits I could see by 
tweaking the values.
This parameter determines the amount of computation required per query 
term, regardless of the number of documents that contain that term.  In 
particular, it is the maximum number of other terms that must be scanned 
before a term is located and its frequency and position information may 
be processed.  In a large index with user-entered query terms, query 
processing time is likely to be dominated not by term lookup but rather 
by the processing of frequency and positional data.  In a small index or 
when many uncommon query terms are generated (e.g., by wildcard queries) 
term lookup may become a dominant cost.  Benchmarking your application 
is the best way to determine this.

There is no single percentage answer.  There are cases where 99% of the 
query processing is in term lookup and there are cases where 1% of the 
query processing is in term lookup.  Chances are that, with a large 
index and user-entered query terms, only a small percentage of the time 
is spent in term lookup and thus increasing this value somewhat will not 
affect overall performance much.

If you need something more precise than "much" or "a bit", measure it.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zip Files

2005-03-01 Thread Luke Shannon
Thanks Ernesto.

The issue I'm working with now (this is more lack of experience than
anything) is getting an input I can index. All my indexing classes (doc,
pdf, xml, ppt) take a File object as a parameter and return a Lucene
Document containing all the fields I need.

I'm struggling with how I can work with an  array of bytes  instead of a
Java File.

It would be easier to unzip the zip to a temp directory, parse the files and
than delete the directory. But this would greatly slow indexing and use up
disk space.

Luke

- Original Message - 
From: "Ernesto De Santis" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, March 01, 2005 10:48 AM
Subject: Re: Zip Files


> Hello
>
> first, you need a parser for each file type: pdf, txt, word, etc.
> and use a java api to iterate zip content, see:
>
> http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
>
> use getNextEntry() method
>
> little example:
>
> ZipInputStream zis = new ZipInputStream(fileInputStream);
> ZipEntry zipEntry;
> while(zipEntry = zis.getNextEntry() != null){
> //use zipEntry to get name, etc.
> //get properly parser for current entry
> //use parser with zis (ZipInputStream)
> }
>
> good luck
> Ernesto
>
> Luke Shannon escribió:
>
> >Hello;
> >
> >Anyone have an ideas on how to index the contents within zip files?
> >
> >Thanks,
> >
> >Luke
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
> -- 
> Ernesto De Santis - Colaborativa.net
> Córdoba 1147 Piso 6 Oficinas 3 y 4
> (S2000AWO) Rosario, SF, Argentina.
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Large Index managing

2005-03-01 Thread Volodymyr Bychkoviak
Hi,
just an idea how to manage large index that is updated very often.
Very often there is need to update an document in index. To update 
document in index you should delete old document from index and then add 
new one. In most cases it require you to open IndexReader, delete 
document, close IndexReader, create IndexWriter, add document, close 
IndexWriter, and re-open IndexSearcher (if index is searched heavily). 
Profiling some applications I found that most time is spend in 
IndexReader.open() method. Also it produces many objects, so it also 
gives GC overhead.

Idea to optimize this process is to create two indexes. One main index 
that could be very large and second index that will serve as "change 
buffer". We can keep one IndexReader open for the first index. (and use 
it for searching and for deleting old documents). Second index is small 
and we can reopen IndexReader frequently when needed.

when second index reaches some number of documents we can merge it with 
main index.
to search this "multi" index we could use MultiSearcher over this two 
indexes but with little trick: first IndexSearcher is kept same during 
all time till second index is merged with main and second IndexSearcher 
is reopened when second index changes.

It is just idea. (It is not tested)
Will it help to improve speed of updating large index and lower memory 
overhead?
Any comments?

Regards,
Volodymyr Bychkoviak

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Zip Files

2005-03-01 Thread Ernesto De Santis
Hello
first, you need a parser for each file type: pdf, txt, word, etc.
and use a java api to iterate zip content, see:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
use getNextEntry() method
little example:
ZipInputStream zis = new ZipInputStream(fileInputStream);
ZipEntry zipEntry;
while(zipEntry = zis.getNextEntry() != null){
   //use zipEntry to get name, etc.
   //get properly parser for current entry
   //use parser with zis (ZipInputStream)
}
good luck
Ernesto
Luke Shannon escribió:
Hello;
Anyone have an ideas on how to index the contents within zip files?
Thanks,
Luke
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Zip Files

2005-03-01 Thread Luke Shannon
Hello;

Anyone have an ideas on how to index the contents within zip files?

Thanks,

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Remove document fails

2005-03-01 Thread Volodymyr Bychkoviak
may be you have open IndexWriter at the same time you are trying to 
delete document.

Alex Kiselevski wrote:
Hi,
I have a problem doing IndexReader.delete(int doc)
and it fails on lock error.


Alex Kiselevski
+9.729.776.4346 (desk)
+9.729.776.1504 (fax)
AMDOCS > INTEGRATED CUSTOMER MANAGEMENT


The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Re[2]: Is IndexSearcher thread safe?

2005-03-01 Thread Cocula Remi

I probably had the same trouble (but I'm not sure).
I have run a test programm that was creating  a lot of IndexSearchers (but also 
close and free them).
It went to an outOfMemory Exception.
But i'm not finished with that problem (need to use a profiler).


>But I have discovered one strange fact. When you have indexSearcher on
>big index, so IndexSearcher object takes a lot of memory (900Mb) and
>when you create new IndexSearcher after deletion of all references to
>old IndexSearcher then memory consumed my old IndexSearcher will not be
>ever freed.
>What can community answer on this strange fact?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Is IndexSearcher thread safe?

2005-03-01 Thread Cocula Remi


>Additional question.
>If I'm sharing one instance of IndexSearcher between different threads 
>Is it good to just to drop this instance to GC.
>Because I don't know if some thread is still using this searcher or done 
>with it.

Note that as far as one of the threads keep a reference on the IndexSearcher it 
can not be garbage collected.
Perhaps you meant that you do not know how a thread can declare that it does no 
more need the indexSearcher.

To cope this that I created an IndexSercher pool.
The pool contains a list of IndexSearchers and each one is associated with a 
counter. 
To get an IndexSearcher reference one must request it to the pool and then the 
counter is incremented.
(To make it cleaner I had the idea to replace IndexSearcher references in the 
pool with proxy objects thus the pool will never distribute references of 
IndexSearchers to clients objects.
The counter can be manage inside the proxy.)

The pool has the ability to close and dereference an IndexSearcher when it is 
no more used (counter=0).

Hope it helps.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Remove document fails

2005-03-01 Thread Alex Kiselevski

Hi,
I have a problem doing IndexReader.delete(int doc)
and it fails on lock error.



Alex Kiselevski

+9.729.776.4346 (desk)
+9.729.776.1504 (fax)

AMDOCS > INTEGRATED CUSTOMER MANAGEMENT




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

RE: help with boolean expression

2005-03-01 Thread Omar Didi
I found something kind fo weird about the way lucene interprets boolean 
expressions wihout parenthesis.
when i run the query A AND B OR C, it returns only the documents that have A(in 
other words as if the query was just the term A). 
when I run the query A OR B AND C, it returns only the documents that have B 
AND C(as if teh query was just B AND C ). I set the default operator in my 
application to be AND. 
can anyone explain this behavior, thanks.

-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Monday, February 28, 2005 2:40 AM
To: Lucene Users List
Subject: Re: help with boolean expression


Omar Didi writes:
> I have a problem understanding how would lucene iterpret this boolean 
> expression : A AND B OR C .
> it neither return the same count as when I enter (A AND B) OR C nor A AND (B 
> OR C). 
> if anyone knows how it is interpreted i would be thankful.
> thanks

A AND B OR C creates a query that requires A and B. C influcenes the 
score, but is neither sufficient nor required for a match.

IMO query parser is broken for queries mixing AND and OR without explicit
braces.
My favorite sample is `a AND b OR c AND d' which equals `a AND b AND c AND d'
in query parser.

I suggested a patch some time ago, but it's still pending in bugzilla.
http://issues.apache.org/bugzilla/show_bug.cgi?id=25820

Don't know if it's still usable with current sources.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Jonathan O'Connor
Apologies Erik,
This must be one of those apostrophe in email address problems I always
get. Recently I removed the apostrophe from the email address I give out.
Our server recognizes both email addresses, but some of these mail lists
don't like the O'Connor clann!
Ciao,
Jonathan O'Connor
XCOM Dublin



Erik Hatcher <[EMAIL PROTECTED]>
01/03/2005 12:16
Please respond to
"Lucene Users List" 


To
"Lucene Users List" 
cc

Subject
Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






I had to moderate both Jonathan and Jon's messages in to the list.
Please subscribe to the list and post to it with the address you've
subscribed.  I cannot always guarantee I'll catch moderation messages
and send them through in a timely fashion.

 Erik

On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:

> Jon,
> I too found some problems with the German analyser recently. Here's
> what
> may help:
> 1. You can try reading Joerg Caumanns' paper "A Fast and Simple
> Stemming
> Algorithm for German Words". This paper describes the algorithm
> implemented by GermanAnalyser.
> 2. I guess German nouns all capitalized, so maybe that's why. Although
> you
> would want to be indexing well written German and not emails or text
> messages!
> 3. The German Stemmer converts umlauts into some funny form (the code
> is a
> bit tricky, and I didn't spend any time looking at it), so maybe thats
> why
> you can't find umlauts properly. I think the main reason for this
> umlaut
> change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
> (that ae is a umlaut).
>
> Finally, to really understand what's happening, get your hands on
> Luke. I
> just got it last week, and its brilliant. It shows you everything about
> your indexes. You can also feed text to an Analyser, and see what it
> makes
> of it. This will show you the real reason why your umlaut search is
> failing.
> Ciao,
> Jonathan O'Connor
> XCOM Dublin
>
>
>
> "Jon Humble" <[EMAIL PROTECTED]>
> 01/03/2005 09:35
> Please respond to
> "Lucene Users List" 
>
>
> To
> 
> cc
>
> Subject
> Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
>
>
>
>
>
>
> Hello,
>
> We?re using the GermanAnalyzer/Stemmer to index/search our (German)
> Website.
> I have a few questions:
>
> (1) Why is the GermanAnalyzer case-sensitive? None of the other
> language indexers seem to be. What does this feature add?
> (2) With the German Analyzer, wildcard searches containing extended
> German characters do not seem to work. So, a* is fine but anä* or ö*
> always find zero results.
> (3) In a similar vein to (2), wildcard searches with escaped
> special
> characters fail to find results. So a search for co\-operative works
> but
> a search for co\-op* fails.
>
> I will be grateful for any light that can be shed on these problems.
>
> With Thanks,
>
> Jon.
>
> Jon Humble
> BSc (hons,)
> Software Engineer
> eMail: [EMAIL PROTECTED]
>
> TecSphere Ltd
> Centre for Advanced Industry
> Coble Dene, Royal Quays
> Newcastle upon Tyne NE29 6DE
> United Kingdom
>
> Direct Dial: +44 (191) 270 31 06
> Fax: +44 (191) 270 31 09
> http://www.tecsphere.com
>
>
>
>
>
>
> *** Aktuelle Veranstaltungen der XCOM AG ***
>
> XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
> Anmeldung und Information unter http://lotus.xcom.de/events
>
> Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in
> Berlin (05.03.2005)
> Anmeldung und Information unter http://lotus.xcom.de/events
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist
> allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt.
> Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail
> untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich
> vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the sole
> use of the intended recipient. Any review, distribution by others or
> forwarding without express permission is strictly prohibited. If you
> are not the intended recipient, please contact the sender and delete
> all copies.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in Berlin 
(05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für 
den Gebrauch durch den vorgesehenen Empfa

Re[2]: Is IndexSearcher thread safe?

2005-03-01 Thread Yura Smolsky
Hello, Volodymyr.

VB> Additional question.
VB> If I'm sharing one instance of IndexSearcher between different threads
VB> Is it good to just to drop this instance to GC.
VB> Because I don't know if some thread is still using this searcher or done
VB> with it.

It is safe to share one instance between many threads and it should be
safe to drop old object to GC.

But I have discovered one strange fact. When you have indexSearcher on
big index, so IndexSearcher object takes a lot of memory (900Mb) and
when you create new IndexSearcher after deletion of all references to
old IndexSearcher then memory consumed my old IndexSearcher will not be
ever freed.
What can community answer on this strange fact?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom filters & document numbers

2005-03-01 Thread tomsdepot-lucene
I'm also interested in knowing what can change the doc numbers.

Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?  If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.

Thanks,

-Tom-

--- Stanislav Jordanov <[EMAIL PROTECTED]> wrote:

> The first statement is clear to me:
> I know that an IndexReader sees a 'snapshot' of the document set that was
> taken in the moment of the Reader's creation.
> 
> What I don't know is whether this 'snapshot' has also its doc numbers fixed
> or they may change asynchronously.
> And another thing I don't know is what are the index operations that may
> cause the (doc -> doc number) mapping to change.
> Is it only after delete or there are other ocasions, or I'd better not count
> on this at all.
> 
> StJ
> 
> - Original Message - 
> From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Thursday, February 24, 2005 4:07 PM
> Subject: RE: Custom filters & document numbers
> 
> 
> > An IndexReader will always see the same set of documents.
> > Even if another process deletes some documents, adds new ones or
> > optimizes the complete index, your IndexReader instance will not see
> > those changes.
> >
> > If you detect that the Lucene index changed (e.g. by calling
> > IndexReader.getCurrentVersion(...) once in a while), you should close
> > and reopen your 'current' IndexReader and recalculate any data that
> > relies on the Lucene document numbers.
> >
> > Regards, Luc.
> >
> > -Original Message-
> > From: Stanislav Jordanov [mailto:[EMAIL PROTECTED]
> > Sent: donderdag 24 februari 2005 14:18
> > To: Lucene Users List
> > Subject: Custom filters & document numbers
> >
> > Given an IndexReader a custom filter is supposed to create a bit set,
> > that maps each document numbers to {'visible', 'invisible'} On the other
> > hand, it is stated that Lucene is allowed to change document numbers.
> > Is it guaranteed that this BitSet's view of document numbers won't
> > change while the BitSet is still in use (or perhaps the corresponding
> > IndexReader is still opened) ?
> >
> > And another (more low-level) question.
> > When Lucene may change document numbers?
> > Is it only when the index is optimized after there has been a delete
> > operation?
> >
> > Regards: StJ
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Erik Hatcher
I had to moderate both Jonathan and Jon's messages in to the list.  
Please subscribe to the list and post to it with the address you've 
subscribed.  I cannot always guarantee I'll catch moderation messages 
and send them through in a timely fashion.

Erik
On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:
Jon,
I too found some problems with the German analyser recently. Here's 
what
may help:
1. You can try reading Joerg Caumanns' paper "A Fast and Simple 
Stemming
Algorithm for German Words". This paper describes the algorithm
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although 
you
would want to be indexing well written German and not emails or text
messages!
3. The German Stemmer converts umlauts into some funny form (the code 
is a
bit tricky, and I didn't spend any time looking at it), so maybe thats 
why
you can't find umlauts properly. I think the main reason for this 
umlaut
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on 
Luke. I
just got it last week, and its brilliant. It shows you everything about
your indexes. You can also feed text to an Analyser, and see what it 
makes
of it. This will show you the real reason why your umlaut search is
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin


"Jon Humble" <[EMAIL PROTECTED]>
01/03/2005 09:35
Please respond to
"Lucene Users List" 
To

cc
Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]


Hello,
We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results.
(3) In a similar vein to (2), wildcard searches with escaped 
special
characters fail to find results. So a search for co\-operative works 
but
a search for co\-op* fails.

I will be grateful for any light that can be shed on these problems.
With Thanks,
Jon.
Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]
TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com


*** Aktuelle Veranstaltungen der XCOM AG ***
XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events
Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in 
Berlin (05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

*** XCOM AG Legal Disclaimer ***
Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist 
allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. 
Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail 
untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich 
vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole 
use of the intended recipient. Any review, distribution by others or 
forwarding without express permission is strictly prohibited. If you 
are not the intended recipient, please contact the sender and delete 
all copies.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Jonathan O'Connor
Jon,
I too found some problems with the German analyser recently. Here's what
may help:
1. You can try reading Joerg Caumanns' paper "A Fast and Simple Stemming
Algorithm for German Words". This paper describes the algorithm
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although you
would want to be indexing well written German and not emails or text
messages!
3. The German Stemmer converts umlauts into some funny form (the code is a
bit tricky, and I didn't spend any time looking at it), so maybe thats why
you can't find umlauts properly. I think the main reason for this umlaut
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on Luke. I
just got it last week, and its brilliant. It shows you everything about
your indexes. You can also feed text to an Analyser, and see what it makes
of it. This will show you the real reason why your umlaut search is
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin



"Jon Humble" <[EMAIL PROTECTED]>
01/03/2005 09:35
Please respond to
"Lucene Users List" 


To

cc

Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






Hello,

We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:

(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results.
(3) In a similar vein to (2), wildcard searches with escaped special
characters fail to find results. So a search for co\-operative works but
a search for co\-op* fails.

I will be grateful for any light that can be shed on these problems.

With Thanks,

Jon.

Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]

TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom

Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com






*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in Berlin 
(05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für 
den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, 
Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine 
fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine 
Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of 
the intended recipient. Any review, distribution by others or forwarding 
without express permission is strictly prohibited. If you are not the intended 
recipient, please contact the sender and delete all copies.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is IndexSearcher thread safe?

2005-03-01 Thread Volodymyr Bychkoviak
Additional question.
If I'm sharing one instance of IndexSearcher between different threads 
Is it good to just to drop this instance to GC.
Because I don't know if some thread is still using this searcher or done 
with it.

Regards,
Volodymyr Bychkoviak
Volodymyr Bychkoviak wrote:
Is it thread-safe to share one
instance of IndexSearcher between multiple threads?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Is IndexSearcher thread safe?

2005-03-01 Thread Volodymyr Bychkoviak
Is it thread-safe to share one
instance of IndexSearcher between multiple threads?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Questions about GermanAnalyzer/Stemmer

2005-03-01 Thread Jon Humble
Hello,
 
We’re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
 
(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results. 
(3) In a similar vein to (2), wildcard searches with escaped special
characters fail to find results. So a search for co\-operative works but
a search for co\-op* fails.
 
I will be grateful for any light that can be shed on these problems.
 
With Thanks,
 
Jon.
 
Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]

TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
 
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com
 
 


Re: Fast access to a random page of the search results.

2005-03-01 Thread Stanislav Jordanov
// The test source code (second attempt).
// Just in case the .txt attachment does not pass through
// I am pasting the code here:

package index_test;

import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;
import java.util.Enumeration;
import java.util.StringTokenizer;
import java.util.ArrayList;

import com.odi.util.query.QueryParseException;

public class Search {
public static void main(String[] args) throws Exception {
if (args.length != 1) {
throw new Exception("Usage: " + Search.class.getName() + "
");
}

File indexDir = new File(args[0]);

if (!indexDir.exists() || !indexDir.isDirectory()) {
throw new Exception(indexDir + " is does not exist or is not a
directory.");
}

System.out.println("Using index: " + indexDir.getCanonicalPath());

BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
search(indexDir);
}

public static void search(File indexDir)  throws Exception {
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexSearcher is = null;
BufferedReader brdr = new BufferedReader(new
InputStreamReader(System.in));

String q;
Sort sort = null;
while (!(q = brdr.readLine()).equals("exit")) {
q = q.trim();
if (is == null || q.equals("newsearcher")) {
is = new IndexSearcher(fsDir);
if (q.equals("newsearcher")) {
continue;
}
}
if (q.startsWith("sort ")) {
StringTokenizer tkz = new StringTokenizer(q);
tkz.nextToken(); // skip the "sort" word
ArrayList sortFields = new
ArrayList();
while (tkz.hasMoreTokens()) {
String tok = tkz.nextToken();
boolean reverse = false;
if (tok.startsWith("-")) {
tok = tok.substring(1);
reverse = true;
}
sortFields.add(new SortField(tok, reverse));
}
sort = new Sort(sortFields.toArray(new SortField[0]));
System.out.println("Sorting by " + sort);
continue;
}
if (q.equals("nosort")) {
sort = null;
System.out.println("Sorting is off");
continue;
}
long startTs = System.currentTimeMillis();
Query query = null;
try {
query  = QueryParser.parse(q, "qcontent", new
StandardAnalyzer(new String[0]));
}
catch (QueryParseException exn) {
exn.printStackTrace();
continue;
}
Hits hits = (sort != null ? is.search(query, sort) :
is.search(query));
int nHits = hits.length();//hc.nHits;
long stopTs  = System.currentTimeMillis();
System.out.println("Found " + nHits + " document(s) that matched
query '" + q + "'");
System.out.println("Sorting by " + sort);
System.out.println("query executed in " + (stopTs - startTs) +
"ms");

if (nHits > 0) {
startTs = System.currentTimeMillis();
dummyMethod(hits.doc(nHits - nHits));
stopTs = System.currentTimeMillis();
System.out.println("Last doc accessed in " + (stopTs -
startTs)
+ "ms");
}
}
}

public static double  dummyMethod(Document doc) {
return doc.getBoost();
}

private static void  dumpDocument(Document doc) throws IOException {
System.out.println("");
for (Enumeration e = doc.fields(); e.hasMoreElements(); ) {
Field f = (Field) e.nextElement();
System.out.println(f.name() + " ::>> '" + f.stringValue() +
"'");
}
System.out.println("");
}
}
package index_test;

import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;
import java.util.Enumeration;
import java.util.StringTokenizer;
import java.util.ArrayList;

import com.odi.util.query.QueryParseException;

public class Search {
public static void main(String[] args) throws Exception {
if (args.length != 1) {
throw new Exception("Usage: " + Search.class.getName() + " ");
}

File indexDir = ne