Using the highlighter from the sandbox with a prefix query.

2005-02-16 Thread lucuser4851
Dear All,
 We have been using the highlighter from the lucene sandbox, which works
very nicely most of the time. However when we try and use it with a
prefix query (which is what you get having parsed a wild-card query), it
doesn't return any highlighted sections. Has anyone else experienced
this problem, or found a way around it?

Thanks a lot for your suggestions!!



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



bookkeeping documents cause problem in Sort

2005-02-16 Thread aurora
I understand that unlike relational database, Lucene is flexible in having  
documents with different set of fields. My index has documents with a date  
and content field. There are also a few book keeping documents that does  
not have the date field. Things work well except in one case:

  Sort sort = Sort('date');
  searcher.search(query, sort);
In this case an exception is thrown:
  java.lang.RuntimeException: field "date" does not appear to be indexed
It does not make sense to sort by 'date' when the document does not has  
'date'. On the other hand I don't expect the search() to return any book  
keeping documents at all since the current look for fields not in those  
documents. Is this an implementation issue or is there any inherent reason  
all document need to have the 'date' field if it is sorted?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


knowing which field contributed the search result

2005-02-16 Thread John Wang
Hi:

   Is there way to find out given a hit from a search, find out which
fields contributed to the hit?

e.g.

If my search for:

contents1="brown fox" OR contents2="black bear"

can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.


Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread PA
On Feb 16, 2005, at 21:28, Yura Smolsky wrote:
Well, I have not 6 CPU in one box :)
What about 6 boxes with 1 CPU each :P
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello, Erik.

EH> Are you using multiple IndexSearcher instances?Or only one and
EH> sharing it across multiple threads?

EH> If using a single shared IndexSearcher instance doesn't help, it may be
EH> beneficial to port your code to Java and try it there.

I have single instance of IndexSearcher and I pass reference of it to each
thread. I will port code to Java if no other ideas will come my
mind...

EH> On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote:

>> Hello.
>>
>> I use PyLucene, python port of Lucene.
>>
>> I have problem about using big index (50Gb) with IndexSearcher
>> from many threads.
>> I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
>> around a Java/libgcj thread that python is tricked into thinking
>> it's one of its own.
>>
>> The core of problem:
>> When I have many threads (more than 5) I receive this exception:
>>   File "/usr/lib/python2.4/site-packages/PyLucene.py", line 2241, in
>> search
>> def search(*args): return _PyLucene.Searcher_search(*args)
>> ValueError: java.lang.OutOfMemoryError
>><>
>>
>> When I decrease number of threads to 3 or even 1 then search works.
>> How do many threads can affect to this exception?..
>>
>> I have 2 Gb of memory. So with one thread the process takes like
>> 1200-1300Mb.
>>
>> Andi Vajda suggested that "There may be overhead involved in having
>> multiple threads against a given index."
>>
>> Does anyone here have experience in handling big indexes with many
>> threads?


Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello, PA.


>> Does anyone here have experience in handling big indexes with many
>> threads?
P> What about turning the problem around and spitting your index in
P> several chunks? Then you could search those (smaller) indices in 
P> parallel and consolidate the final result, no?

Well, I have not 6 CPU in one box :)

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: big index and multi threaded IndexSearcher

2005-02-16 Thread Erik Hatcher
Are you using multiple IndexSearcher instances?Or only one and 
sharing it across multiple threads?

If using a single shared IndexSearcher instance doesn't help, it may be 
beneficial to port your code to Java and try it there.

I'm just now getting into PyLucene myself - building a demo for a Unix 
User's Group presentation I'm giving.

Erik
On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote:
Hello.
I use PyLucene, python port of Lucene.
I have problem about using big index (50Gb) with IndexSearcher
from many threads.
I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
around a Java/libgcj thread that python is tricked into thinking
it's one of its own.
The core of problem:
When I have many threads (more than 5) I receive this exception:
  File "/usr/lib/python2.4/site-packages/PyLucene.py", line 2241, in 
search
def search(*args): return _PyLucene.Searcher_search(*args)
ValueError: java.lang.OutOfMemoryError
   <>

When I decrease number of threads to 3 or even 1 then search works.
How do many threads can affect to this exception?..
I have 2 Gb of memory. So with one thread the process takes like
1200-1300Mb.
Andi Vajda suggested that "There may be overhead involved in having
multiple threads against a given index."
Does anyone here have experience in handling big indexes with many
threads?
Any ideas are appreciated.
Yura Smolsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: big index and multi threaded IndexSearcher

2005-02-16 Thread PA
On Feb 16, 2005, at 21:04, Yura Smolsky wrote:
Does anyone here have experience in handling big indexes with many
threads?
What about turning the problem around and spitting your index in 
several chunks? Then you could search those (smaller) indices in 
parallel and consolidate the final result, no?

Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello.

I use PyLucene, python port of Lucene.

I have problem about using big index (50Gb) with IndexSearcher
from many threads.
I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
around a Java/libgcj thread that python is tricked into thinking
it's one of its own.

The core of problem:
When I have many threads (more than 5) I receive this exception:
  File "/usr/lib/python2.4/site-packages/PyLucene.py", line 2241, in search
def search(*args): return _PyLucene.Searcher_search(*args)
ValueError: java.lang.OutOfMemoryError
   <>

When I decrease number of threads to 3 or even 1 then search works.
How do many threads can affect to this exception?..

I have 2 Gb of memory. So with one thread the process takes like
1200-1300Mb.

Andi Vajda suggested that "There may be overhead involved in having
multiple threads against a given index."

Does anyone here have experience in handling big indexes with many
threads?

Any ideas are appreciated.

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Concurrent searching & re-indexing

2005-02-16 Thread Paul Mellor
But all write access to the index is synchronized, so that although multiple
threads are creating an IndexWriter for the same directory and using it to
totally recreate that index, only one thread is doing this at once.

I was concerned about the safety of using an IndexSearcher to perform
queries on an index that is in the process of being recreated from scratch,
but I guess that if the IndexSearcher takes a snapshot of the index when it
is created (and in my code this creation is synchronized with the write
operations as well so that the threads wait for the write operations to
finish before instantiating an IndexSearcher, and vice versa) this can't be
a problem.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:30
To: Lucene Users List
Subject: Re: Concurrent searching & re-indexing


Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a "no no".

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I've read from various sources on the Internet that it is perfectly
> safe to
> simultaneously search a Lucene index that is being updated from
> another
> Thread, as long as all write access to the index is synchronized. 
> But does
> this apply only to updating the index (i.e. deleting and adding
> documents),
> or to a complete re-indexing (i.e. create a new IndexWriter with the
> 'create' argument true and then re-add all the documents)?
> 
> I have a class which encapsulates all access to my index, so that
> writes can
> be synchronized.  This class also exposes a method to obtain an
> IndexSearcher for the index.  I'm running unit tests to test this
> which
> create many threads - each thread does a complete re-indexing and
> then
> obtains an IndexSearcher and does a query.
> 
> I'm finding that with sufficiently high numbers of threads, I'm
> getting the
> occasional failure, with the following exception thrown when
> attempting to
> construct a new IndexWriter (during the reindexing) -
> 
> java.io.IOException: couldn't delete _a.f1
> at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
> at
>
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
> at
>
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:151)
> ...
> 
> The exception occurs quite infrequently (usually for somewhere
> between 1-5%
> of the Threads).
> 
> Does the IndexSearcher take a 'snapshot' of the index at creation? 
> Or does
> it access the filesystem whilst searching?  I am also synchronizing
> creation
> of the IndexSearcher with the write lock, so that the IndexSearcher
> is not
> created whilst the index is being recreated (and vice versa).  But do
> I need
> to ensure that the IndexSearcher cannot search whilst the index is
> being
> recreated as well?
> 
> Note that a similar unit test where the threads update the index
> (rather
> than recreate it from scratch) works fine, as expected.
> 
> This is running on Windows 2000.
> 
> Any help would be much appreciated!
> 
> Paul
> 
> This e-mail and any files transmitted with it are confidential and
> intended
> solely for the use of the individual or entity to whom they are
> addressed.
> If you are not the intended recipient, you should not copy,
> retransmit or
> use the e-mail and/or files transmitted with it  and should not
> disclose
> their contents. In such a case, please notify
> [EMAIL PROTECTED]
> and delete the message from your own system. Any opinions expressed
> in this
> e-mail and/or files transmitted with it that do not relate to the
> official
> business of this company are those solely of the author and should
> not be
> interpreted as being endorsed by this company.
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, ple

Re: Concurrent searching & re-indexing

2005-02-16 Thread Otis Gospodnetic
Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a "no no".

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I've read from various sources on the Internet that it is perfectly
> safe to
> simultaneously search a Lucene index that is being updated from
> another
> Thread, as long as all write access to the index is synchronized. 
> But does
> this apply only to updating the index (i.e. deleting and adding
> documents),
> or to a complete re-indexing (i.e. create a new IndexWriter with the
> 'create' argument true and then re-add all the documents)?
> 
> I have a class which encapsulates all access to my index, so that
> writes can
> be synchronized.  This class also exposes a method to obtain an
> IndexSearcher for the index.  I'm running unit tests to test this
> which
> create many threads - each thread does a complete re-indexing and
> then
> obtains an IndexSearcher and does a query.
> 
> I'm finding that with sufficiently high numbers of threads, I'm
> getting the
> occasional failure, with the following exception thrown when
> attempting to
> construct a new IndexWriter (during the reindexing) -
> 
> java.io.IOException: couldn't delete _a.f1
> at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
> at
>
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
> at
>
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:151)
> ...
> 
> The exception occurs quite infrequently (usually for somewhere
> between 1-5%
> of the Threads).
> 
> Does the IndexSearcher take a 'snapshot' of the index at creation? 
> Or does
> it access the filesystem whilst searching?  I am also synchronizing
> creation
> of the IndexSearcher with the write lock, so that the IndexSearcher
> is not
> created whilst the index is being recreated (and vice versa).  But do
> I need
> to ensure that the IndexSearcher cannot search whilst the index is
> being
> recreated as well?
> 
> Note that a similar unit test where the threads update the index
> (rather
> than recreate it from scratch) works fine, as expected.
> 
> This is running on Windows 2000.
> 
> Any help would be much appreciated!
> 
> Paul
> 
> This e-mail and any files transmitted with it are confidential and
> intended
> solely for the use of the individual or entity to whom they are
> addressed.
> If you are not the intended recipient, you should not copy,
> retransmit or
> use the e-mail and/or files transmitted with it  and should not
> disclose
> their contents. In such a case, please notify
> [EMAIL PROTECTED]
> and delete the message from your own system. Any opinions expressed
> in this
> e-mail and/or files transmitted with it that do not relate to the
> official
> business of this company are those solely of the author and should
> not be
> interpreted as being endorsed by this company.
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrent searching & re-indexing

2005-02-16 Thread Paul Mellor
Hi,

I've read from various sources on the Internet that it is perfectly safe to
simultaneously search a Lucene index that is being updated from another
Thread, as long as all write access to the index is synchronized.  But does
this apply only to updating the index (i.e. deleting and adding documents),
or to a complete re-indexing (i.e. create a new IndexWriter with the
'create' argument true and then re-add all the documents)?

I have a class which encapsulates all access to my index, so that writes can
be synchronized.  This class also exposes a method to obtain an
IndexSearcher for the index.  I'm running unit tests to test this which
create many threads - each thread does a complete re-indexing and then
obtains an IndexSearcher and does a query.

I'm finding that with sufficiently high numbers of threads, I'm getting the
occasional failure, with the following exception thrown when attempting to
construct a new IndexWriter (during the reindexing) -

java.io.IOException: couldn't delete _a.f1
at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:151)
...

The exception occurs quite infrequently (usually for somewhere between 1-5%
of the Threads).

Does the IndexSearcher take a 'snapshot' of the index at creation?  Or does
it access the filesystem whilst searching?  I am also synchronizing creation
of the IndexSearcher with the write lock, so that the IndexSearcher is not
created whilst the index is being recreated (and vice versa).  But do I need
to ensure that the IndexSearcher cannot search whilst the index is being
recreated as well?

Note that a similar unit test where the threads update the index (rather
than recreate it from scratch) works fine, as expected.

This is running on Windows 2000.

Any help would be much appreciated!

Paul

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, please notify [EMAIL PROTECTED]
and delete the message from your own system. Any opinions expressed in this
e-mail and/or files transmitted with it that do not relate to the official
business of this company are those solely of the author and should not be
interpreted as being endorsed by this company.


Re: Multiple Keywords/Keyphrases fields

2005-02-16 Thread Paul Elschot
On Wednesday 16 February 2005 06:49, Owen Densmore wrote:
> > From: Erik Hatcher <[EMAIL PROTECTED]>
> > Date: February 12, 2005 3:09:15 PM MST
> > To: "Lucene Users List" 
> > Subject: Re: Multiple Keywords/Keyphrases fields
> >
> >
> > The real question to answer is what types of queries you're planning 
> > on making.  Rather than look at it from indexing forward, consider it 
> > from searching backwards.
> >
> > How will users query using those keyword phrases?
> 
> Hi Erik.  Good point.
> 
> There are two uses we are making of the keyphrases:
> 
>   - Graphical Navigation: A Flash graphical browser will allow users to 
> fly around in a space of documents, choosing what to be viewing: 
> Authors, Keyphrases and Textual terms.  In any of these cases, the 
> "closeness" of any of the fields will govern how close they will appear 
> graphically.  In the case of authors, we will weight collaboration .. 
> how often the authors work together.  In the case of Keyphrases, we 
> will want to use something like distance vectors like you show in the 
> book using the cosine measure.  Thus the keyphrases need to be separate 
> entities within the document .. it would be a bug for us if the terms 
> leaked across the separate kephrases within the document.
> 
>   - Textual Search: In this case, we will have two ways to search the 
> keyphrases.  The first would be like the graphical navigation above 
> where searching for "complex system" should require the terms to be in 
> a single keyphrase.  The second way will be looser, where we may simply 
> pool the keyphrases with titles and abstract, and allow them all to be 
> searched together within the document.
> 
> Does this make sense?  So the question from the search standpoint is: 
> do multiple instances of a field act like there are barriers across the 
> instances, or are they somehow treated as a single instance somehow.  

Multiple field instances with the same name in a document are concatenated in
the index in the order in which they where added to the document.
For each instance of a field in the document, even when it has the same name, 
the analyzer is asked to provide a new tokenstream. 

This happens in org.apache.lucene.index.DocumentWriter.invertDocument(),
The last position offset in the field as indexed is maintained for this
purpose.

> In terms of the closeness calculation, for example, can we get separate 
> term vectors for each instance of the keyphrase field, or will we get a 
> single vector combining all the keyphrase terms within a single 
> document?

The positions in the TermVectors are treated in the same way.

To put a barrier between field instances with the same name
one can put a gap in the indexed term positions. This gap needs a larger
query proximity to match. AND like queries will match in the indexed field.

A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.

Regards,
Paul Elschot

> 
> I hope this is clear!  Kinda hard to articulate.
> 
> Owen
> 
> > Erik
> >
> > On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
> >
> >> I'm getting a bit more serious about the final form of our lucene 
> >> index.  Each document has DocNumber, Authors, Title, Abstract, and 
> >> Keywords.  By Keywords, I mean a comma separated list, each entry 
> >> having possibly many terms in a phrase like:
> >>temporal infomax, finite state automata, Markov chains,
> >>conditional entropy, neural information processing
> >>
> >> I presume I should be using a field "Keywords" which have many 
> >> "entries" or "instances" per document (one per comma separated 
> >> phrase).  But I'm not sure the right way to handle all this.  My 
> >> assumption is that I should analyze them individually, just as we do 
> >> for free text (the Abstract, for example), thus in the example above 
> >> having 5 entries of the nature
> >>doc.add(Field.Text("Keywords", "finite state automata"));
> >> etc, analyzing them because these are author-supplied strings with no 
> >> canonical form.
> >>
> >> For guidance, I looked in the archive and found the attached email, 
> >> but I didn't see the answer.  (I'm not concerned about the dups, I 
> >> presume that is equivalent to a boos of some sort) Does this seem 
> >> right?
> >>
> >> Thanks once again.
> >>
> >> Owen
> >>
> >>> From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> >>> Subject: Multiple equal Fields?
> >>> Date: Tue, 17 Feb 2004 12:47:58 +0100
> >>>
> >>> Hi!
> >>> What happens if I do this:
> >>>
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "blah"));
> >>>
> >>> Is there a field "foo" with value "blah" or are there two "foo"s 
> >>> (actually not
> >>> possible) or is there one "foo" with the values "bar" and "blah"?
> >>>
> >>> And what does happen in this case:
> >>>
> >>> doc.add(Field.Text("foo"