Re: Query Tuning

2005-02-21 Thread Paul Elschot
On Monday 21 February 2005 19:59, Runde, Kevin wrote:
 Hi All,
 
 How does Lucene handle multi term queries? Does it use short circuiting?
 So if a user entered:
 (a OR b) AND c
 But my program knew testing for c is cheaper than testing for (a OR
 b) and I rewrote the query as:
 c AND (a OR b)
 Would the query run faster?

Exchanging the operands of AND would not make a noticeable difference
in speed. Queries are evaluated by iterating the inverted term index entries
for all query terms  in parallel, with buffering.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query Tuning

2005-02-21 Thread Paul Elschot
On Monday 21 February 2005 20:43, Todd VanderVeen wrote:
 Runde, Kevin wrote:
 
 Hi All,
 
 How does Lucene handle multi term queries? Does it use short circuiting?
 So if a user entered:
 (a OR b) AND c
 But my program knew testing for c is cheaper than testing for (a OR
 b) and I rewrote the query as:
 c AND (a OR b)
 Would the query run faster?
 
 Sorry if this has already be answered, but for some reason the Archive
 search is not working for me today.
 
 Thanks,
 Kevin
 
 
   
 
 Not sure about what is in CVS, but look at BooleanQuery.scorer(). If all 

It's in svn nowadays.

 of the clauses of the BooleanQuery are required and none of the clauses 
 are BooleanQueries a ConjunctionScorer is returned that offers the 
 optimizations you seek. In the example you gave, there is a clause that 
 is boolean ( a or b) that will have to be evaluated independently with a 
 boolean scorer. This will be performed regardless of the ordering. 
 (BooleanScorer doesn't preserve document order when it return results 
 and hence it can't utilize the optimal algorithm provided by 
 ConjuntionScorer). Others have been down this path as evidenced by the 
 sigh in the javadoc.

In the svn version a ConjunctionScorer is used for all top level AND queries.
 
 If calculating (a or b) is expensive and the docFreq of a is much less 
 than the union of a and b, you might consider rewriting it to (a and c) 
 or (b and c) using DeMorgan's law. Expansion like this isn't always 
 beneficial and can't be applied blindly. As far as I can tell there is 

In the svn version the subquery (a or b) is only evaluated for documents
matching c. In the current version the expansion to
(a and c) or (b and c)
might help: the tradeoff is between evaluating c twice and having
less work for the OR operator.

 no query planning/optimization aside from the merging of related clauses 
 and attempts to rewrite to simpler queries.

One optimization in the current version is the use of ConjunctionScorer
for some cases. One such case, which happens a lot in practice, is a
query that has a few required terms.

Another optimization in the current version that some scoring is done ahead
for each clause into an unordered buffer.
This helps for top level OR queries, but loses for OR queries that are
subqueries of AND.

The svn version does not score ahead. It relies on the buffering done by
TermScorer. Perhaps the buffering for a TermScorer should be made
dependent on it's expected use: more buffering for top level OR, less 
buffering when used under AND.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optional Terms in a single query

2005-02-21 Thread Paul Elschot
On Monday 21 February 2005 23:23, Luke Shannon wrote:
 Hi;
 
 I'm trying to create a query that look for a field containing type:181 and
 name doesn't contain tim, bill or harry.

type: 181  -(name: tim name:bill name:harry)

 +(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere))

stillHere is normally lowercased before searching. Is that ok?

 +(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere))
 +(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere))

typo? olfaithfull 

 +(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere))

typo? (type:1 81)
 
 I would really think to do this all in one Query. Is this even possible?

How would you want to combine the results?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-19 Thread Paul Elschot
Erik,

On Saturday 19 February 2005 01:33, Erik Hatcher wrote:
 
 On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
 
  On Friday 18 February 2005 21:55, Erik Hatcher wrote:
 
  On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
 
  Erik,
 
  Just curious: it would seem easier to use multiple fields for the
  original case and lowercase searching. Is there any particular reason
  you analyzed the documents to multiple indexes instead of multiple
  fields?
 
  I considered that approach, however to expose QueryParser I'd have to
  get tricky.  If I have title_orig and title_lc fields, how would I
  allow freeform queries of title:something?
 
  By lowercasing the querytext and searching in title_lc ?
 
 Well sure, but how about this query:
 
   title:Something AND anotherField:someOtherValue
 
 QueryParser, as-is, won't be able to do field-name swapping.  I could 
 certainly apply that technique on all the structured queries that I 
 build up with the API, but with QueryParser it is trickier.   I'm 
 definitely open for suggestions on improving how case is handled.  The 

Overriding this (1.4.3 QueryParser.jj, line 286) might work:

protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }

It will be called by the parser for both parts of the query above, so one
could change the field depending on the requested type of search
and the field name in the query.

 only drawback now is that I'm duplicating indexes, but that is only an 
 issue in how long it takes to rebuild the index from scratch (currently 
 about 20 minutes or so on a good day - when the machine isn't swamped).

Once the users get the hang of this, you might end up having to quadruple
the index, or more.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-19 Thread Paul Elschot
On Saturday 19 February 2005 11:02, Erik Hatcher wrote:
 
 On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
  By lowercasing the querytext and searching in title_lc ?
 
  Well sure, but how about this query:
 
 title:Something AND anotherField:someOtherValue
 
  QueryParser, as-is, won't be able to do field-name swapping.  I could
  certainly apply that technique on all the structured queries that I
  build up with the API, but with QueryParser it is trickier.   I'm
  definitely open for suggestions on improving how case is handled.  The
 
  Overriding this (1.4.3 QueryParser.jj, line 286) might work:
 
  protected Query getFieldQuery(String field, String queryText)
  throws ParseException { ... }
 
  It will be called by the parser for both parts of the query above, so 
  one
  could change the field depending on the requested type of search
  and the field name in the query.
 
 But that wouldn't work for any other type of query 
 title:somethingFuzzy~

To get that it would be necessary to override all query parser
methods that take a field argument.

 
 Though now that I think more about it, a simple s/title:/title_orig:/ 
 before parsing would work, and of course make the default field 

In the overriding getFieldQuery method something like:

if (caseSensitiveSearch(field)  originalFieldIndexed(field)) {
  field = field + _orig;
} else { //the other 3 cases
 ...
}
return super.getFieldQuery(field, queryText);

The if statement could be factored out for the other overriding methods.

 dynamic.   I need to evaluate how many fields would need to be done 
 this way - it'd be several.  Thanks for the food for thought!
 
  only drawback now is that I'm duplicating indexes, but that is only an
  issue in how long it takes to rebuild the index from scratch 
  (currently
  about 20 minutes or so on a good day - when the machine isn't 
  swamped).
 
  Once the users get the hang of this, you might end up having to 
  quadruple
  the index, or more.
 
 Why would that be?   They want a case sensitive/insensitive switch.  
 How would it expand beyond that?

With an index for every combination of fields and case sensitivity for these
fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Concurrent searching re-indexing

2005-02-18 Thread Paul Mellor
Ok, I will change my reindex method to delete all documents and then re-add
them all, rather than using an IndexWriter to write a completely new index.

Thanks for the help on this everyone.

Paul

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: 17 February 2005 22:26
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Paul Mellor wrote:
 I've read from various sources on the Internet that it is perfectly safe
to
 simultaneously search a Lucene index that is being updated from another
 Thread, as long as all write access to the index is synchronized.  But
does
 this apply only to updating the index (i.e. deleting and adding
documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
[ ...]
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
[...]
 This is running on Windows 2000.

On Windows one cannot delete a file while it is still open.  So, no, on 
Windows one cannot remove an index entirely while an IndexReader or 
Searcher is still open on it, since it is simply impossible to remove 
all the files in the index.

We might attempt to patch this by keeping a list of such files and 
attempt to delete them later (as is done when updating an index).  But 
this could cause problems, as a new index will eventually try to use 
these same file names again, and it would then conflict with the open 
IndexReader.  This is not a problem when updating an existing index, 
since filenames (except for a few which are not kept open, like 
segments) are never reused in the lifetime of an index.  So, in order 
for such a fix to work we would need to switch to globally unique 
segment names, e.g., long random strings, rather than increasing integers.

In the meantime, the safe way to rebuild an index from scratch while 
other processes are reading it is simply to delete all of its documents, 
then start adding new ones.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, please notify [EMAIL PROTECTED]
and delete the message from your own system. Any opinions expressed in this
e-mail and/or files transmitted with it that do not relate to the official
business of this company are those solely of the author and should not be
interpreted as being endorsed by this company.


Re: Lucene in the Humanities

2005-02-18 Thread Paul Elschot
Erik,

Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple fields?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-18 Thread Paul Elschot
On Friday 18 February 2005 21:55, Erik Hatcher wrote:
 
 On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
 
  Erik,
 
  Just curious: it would seem easier to use multiple fields for the
  original case and lowercase searching. Is there any particular reason
  you analyzed the documents to multiple indexes instead of multiple 
  fields?
 
 I considered that approach, however to expose QueryParser I'd have to 
 get tricky.  If I have title_orig and title_lc fields, how would I 
 allow freeform queries of title:something?

By lowercasing the querytext and searching in title_lc ?

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Concurrent searching re-indexing

2005-02-17 Thread Paul Mellor
Otis,

Looking at your reply again, I have a couple of questions -

IndexSearcher (IndexReader, really) does take a snapshot of the index state
when it is opened, so at that time the index segments listed in segments
should be in a complete state.  It also reads index files when searching, of
course.

1. If IndexReader takes a snapshot of the index state when opened and then
reads the files when searching, what would happen if the files it takes a
snapshot of are deleted before the search is performed (as would happen with
a reindexing in the period between opening an IndexSearcher and using it to
search)?

2. Does a similar potential problem exist when optimising an index, if this
combines all the segments into a single file?

Many thanks

Paul

-Original Message-
From: Paul Mellor [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:37
To: 'Lucene Users List'
Subject: RE: Concurrent searching  re-indexing


But all write access to the index is synchronized, so that although multiple
threads are creating an IndexWriter for the same directory and using it to
totally recreate that index, only one thread is doing this at once.

I was concerned about the safety of using an IndexSearcher to perform
queries on an index that is in the process of being recreated from scratch,
but I guess that if the IndexSearcher takes a snapshot of the index when it
is created (and in my code this creation is synchronized with the write
operations as well so that the threads wait for the write operations to
finish before instantiating an IndexSearcher, and vice versa) this can't be
a problem.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:30
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official

Re: Multiple Keywords/Keyphrases fields

2005-02-16 Thread Paul Elschot
On Wednesday 16 February 2005 06:49, Owen Densmore wrote:
  From: Erik Hatcher [EMAIL PROTECTED]
  Date: February 12, 2005 3:09:15 PM MST
  To: Lucene Users List lucene-user@jakarta.apache.org
  Subject: Re: Multiple Keywords/Keyphrases fields
 
 
  The real question to answer is what types of queries you're planning 
  on making.  Rather than look at it from indexing forward, consider it 
  from searching backwards.
 
  How will users query using those keyword phrases?
 
 Hi Erik.  Good point.
 
 There are two uses we are making of the keyphrases:
 
   - Graphical Navigation: A Flash graphical browser will allow users to 
 fly around in a space of documents, choosing what to be viewing: 
 Authors, Keyphrases and Textual terms.  In any of these cases, the 
 closeness of any of the fields will govern how close they will appear 
 graphically.  In the case of authors, we will weight collaboration .. 
 how often the authors work together.  In the case of Keyphrases, we 
 will want to use something like distance vectors like you show in the 
 book using the cosine measure.  Thus the keyphrases need to be separate 
 entities within the document .. it would be a bug for us if the terms 
 leaked across the separate kephrases within the document.
 
   - Textual Search: In this case, we will have two ways to search the 
 keyphrases.  The first would be like the graphical navigation above 
 where searching for complex system should require the terms to be in 
 a single keyphrase.  The second way will be looser, where we may simply 
 pool the keyphrases with titles and abstract, and allow them all to be 
 searched together within the document.
 
 Does this make sense?  So the question from the search standpoint is: 
 do multiple instances of a field act like there are barriers across the 
 instances, or are they somehow treated as a single instance somehow.  

Multiple field instances with the same name in a document are concatenated in
the index in the order in which they where added to the document.
For each instance of a field in the document, even when it has the same name, 
the analyzer is asked to provide a new tokenstream. 

This happens in org.apache.lucene.index.DocumentWriter.invertDocument(),
The last position offset in the field as indexed is maintained for this
purpose.

 In terms of the closeness calculation, for example, can we get separate 
 term vectors for each instance of the keyphrase field, or will we get a 
 single vector combining all the keyphrase terms within a single 
 document?

The positions in the TermVectors are treated in the same way.

To put a barrier between field instances with the same name
one can put a gap in the indexed term positions. This gap needs a larger
query proximity to match. AND like queries will match in the indexed field.

A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.

Regards,
Paul Elschot

 
 I hope this is clear!  Kinda hard to articulate.
 
 Owen
 
  Erik
 
  On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
 
  I'm getting a bit more serious about the final form of our lucene 
  index.  Each document has DocNumber, Authors, Title, Abstract, and 
  Keywords.  By Keywords, I mean a comma separated list, each entry 
  having possibly many terms in a phrase like:
 temporal infomax, finite state automata, Markov chains,
 conditional entropy, neural information processing
 
  I presume I should be using a field Keywords which have many 
  entries or instances per document (one per comma separated 
  phrase).  But I'm not sure the right way to handle all this.  My 
  assumption is that I should analyze them individually, just as we do 
  for free text (the Abstract, for example), thus in the example above 
  having 5 entries of the nature
 doc.add(Field.Text(Keywords, finite state automata));
  etc, analyzing them because these are author-supplied strings with no 
  canonical form.
 
  For guidance, I looked in the archive and found the attached email, 
  but I didn't see the answer.  (I'm not concerned about the dups, I 
  presume that is equivalent to a boos of some sort) Does this seem 
  right?
 
  Thanks once again.
 
  Owen
 
  From: [EMAIL PROTECTED] [EMAIL PROTECTED]
  Subject: Multiple equal Fields?
  Date: Tue, 17 Feb 2004 12:47:58 +0100
 
  Hi!
  What happens if I do this:
 
  doc.add(Field.Text(foo, bar));
  doc.add(Field.Text(foo, blah));
 
  Is there a field foo with value blah or are there two foos 
  (actually not
  possible) or is there one foo with the values bar and blah?
 
  And what does happen in this case:
 
  doc.add(Field.Text(foo, bar));
  doc.add(Field.Text(foo, bar));
  doc.add(Field.Text(foo, bar));
 
  Does lucene store this only once?
 
  Timo
 
 
 
 
 -
 To unsubscribe, e

RE: Concurrent searching re-indexing

2005-02-16 Thread Paul Mellor
But all write access to the index is synchronized, so that although multiple
threads are creating an IndexWriter for the same directory and using it to
totally recreate that index, only one thread is doing this at once.

I was concerned about the safety of using an IndexSearcher to perform
queries on an index that is in the process of being recreated from scratch,
but I guess that if the IndexSearcher takes a snapshot of the index when it
is created (and in my code this creation is synchronized with the write
operations as well so that the threads wait for the write operations to
finish before instantiating an IndexSearcher, and vice versa) this can't be
a problem.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 16 February 2005 17:30
To: Lucene Users List
Subject: Re: Concurrent searching  re-indexing


Hi Paul,

If I understand your setup correctly, it looks like you are running
multiple threads that create IndexWriter for the ame directory.  That's
a no no.

This section (first hit) describes all various concurrency issues with
regards to adds, updates, optimization, and searches:
  http://www.lucenebook.com/search?query=concurrent

IndexSearcher (IndexReader, really) does take a snapshot of the index
state when it is opened, so at that time the index segments listed in
segments should be in a complete state.  It also reads index files when
searching, of course.

Otis


--- Paul Mellor [EMAIL PROTECTED] wrote:

 Hi,
 
 I've read from various sources on the Internet that it is perfectly
 safe to
 simultaneously search a Lucene index that is being updated from
 another
 Thread, as long as all write access to the index is synchronized. 
 But does
 this apply only to updating the index (i.e. deleting and adding
 documents),
 or to a complete re-indexing (i.e. create a new IndexWriter with the
 'create' argument true and then re-add all the documents)?
 
 I have a class which encapsulates all access to my index, so that
 writes can
 be synchronized.  This class also exposes a method to obtain an
 IndexSearcher for the index.  I'm running unit tests to test this
 which
 create many threads - each thread does a complete re-indexing and
 then
 obtains an IndexSearcher and does a query.
 
 I'm finding that with sufficiently high numbers of threads, I'm
 getting the
 occasional failure, with the following exception thrown when
 attempting to
 construct a new IndexWriter (during the reindexing) -
 
 java.io.IOException: couldn't delete _a.f1
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135)
 at

org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151)
 ...
 
 The exception occurs quite infrequently (usually for somewhere
 between 1-5%
 of the Threads).
 
 Does the IndexSearcher take a 'snapshot' of the index at creation? 
 Or does
 it access the filesystem whilst searching?  I am also synchronizing
 creation
 of the IndexSearcher with the write lock, so that the IndexSearcher
 is not
 created whilst the index is being recreated (and vice versa).  But do
 I need
 to ensure that the IndexSearcher cannot search whilst the index is
 being
 recreated as well?
 
 Note that a similar unit test where the threads update the index
 (rather
 than recreate it from scratch) works fine, as expected.
 
 This is running on Windows 2000.
 
 Any help would be much appreciated!
 
 Paul
 
 This e-mail and any files transmitted with it are confidential and
 intended
 solely for the use of the individual or entity to whom they are
 addressed.
 If you are not the intended recipient, you should not copy,
 retransmit or
 use the e-mail and/or files transmitted with it  and should not
 disclose
 their contents. In such a case, please notify
 [EMAIL PROTECTED]
 and delete the message from your own system. Any opinions expressed
 in this
 e-mail and/or files transmitted with it that do not relate to the
 official
 business of this company are those solely of the author and should
 not be
 interpreted as being endorsed by this company.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_
This e-mail has been scanned for viruses by MCI's Internet Managed Scanning
Services - powered by MessageLabs. For further information visit
http://www.mci.com

This e-mail and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you are not the intended recipient, you should not copy, retransmit or
use the e-mail and/or files transmitted with it  and should not disclose
their contents. In such a case, please notify [EMAIL PROTECTED]
and delete the message from your own system. Any

Re: Newbie questions

2005-02-14 Thread Paul Jans
Hi again,

So is SqlDirectory recommended for use in a cluster to
workaround the accessibility problem, or are people
using NFS or a standalone server instead?

Thanks in advance,
PJ

--- Paul Jans [EMAIL PROTECTED] wrote:

 I've already ordered Lucene in Action :)
 
  There is a LuceneRAR project that is still in its
  infancy here: 
  https://lucenerar.dev.java.net/
 
 I will keep an eye on that for sure.
 
  You can also store a Lucene index in Berkeley DB
  (look at the 
  /contrib/db area of the source code repository)
 
 We're already using Oracle, so would it be possible
 to
 store the index there, thus giving each cluster node
 easy access to it. I read about SqlDirectory in the
 archives but it looks like it didn't make it to the
 API and I don't see it on the contrib page.
 
 I'm more concerned about making the index accessible
 rather than transactional consistency, so NFS may be
 another option like you mention. I'm curious to hear
 about other systems which are clustered and how
 others
 are doing this; lessons learnt and best practices
 etc.
 
 Thanks again for the help. Lucene looks like a first
 class tool.
 
 PJ
 
 --- Erik Hatcher [EMAIL PROTECTED] wrote:
 
  
  On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
   A couple of newbie questions. I've searched the
   archives and read the Javadoc but I'm still
 having
   trouble figuring these out.
  
  Don't forget to get your copy of Lucene in
 Action
  too :)
  
   1. What's the best way to index and handle
 queries
   like the following:
  
   Find me all users with (a CS degree and a GPA 
  3.0)
   or (a Math degree and a GPA  3.5).
  
  Some suggestions:  index degree as a Keyword
 field. 
  Pad GPA, so that 
  all of them are the form #.# (or #.## maybe). 
  Numerics need to be 
  lexicographically ordered, and thus padded.
  
  With the right analyzer (see the AnalysisParalysis
  page on the wiki) 
  you could use this type of query with
 QueryParser:'
  
  degree:cs AND gpa:[3.0 TO 9.9]
  
   2. What are the best practices for using Lucene
 in
  a
   clustered J2EE environment? A standalone
  index/search
   server or storing the index in the database or
   something else ?
  
  There is a LuceneRAR project that is still in its
  infancy here: 
  https://lucenerar.dev.java.net/
  
  You can also store a Lucene index in Berkeley DB
  (look at the 
  /contrib/db area of the source code repository)
  
  However, most projects do fine with cruder
  techniques such as sharing 
  the Lucene index on a common drive and ensuring
 that
  locking is 
  configured to use the common drive also.
  
  Erik
  
  
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Helps protect you from nasty viruses. 
 http://promotions.yahoo.com/new_mail
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: chained restrictive queries

2005-02-14 Thread Paul Elschot
On Monday 14 February 2005 15:14, [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm currently working on application using Lucene 1.3 , and have to improve
 the current indexation/search methods with the 1.4.3 version.
 
 
 I was thinking to use the FilteredQuery object to refine my chained queries
 but, after some tests, performances are worst :(.

 The chained queries were like :
 - a first boolean query to retrieve a set of doc id matching some criterias

A FilteredQuery works best when the filter from the criterias can be reused,
eg. by keeping it in a cache, possibly with CachingWrapperFilter.

 - a second query applying a fuzzy criteria to refine it more deeply.
 
 My index contains like 7 millions of document at all , and first query
 should retrieve, at maximum, like 50 000 documents.

 I'm currently working with crossed indexes while doing searches , but i
 want to remove the extra indexes and do all things with only one.
 
 So, is it possible to use the FilteredQuery object or another one to chain
 queries from the most restrictive to the most open one ?

It is possible, but whether it helps performance depends on your
circumstances.

The 1.4.3 filter implementation executes the most open query almost
completely.
It only applies the filter after the score computations for the
query being filtered, just before deciding whether to keep the docment
in the query results.
This is done in IndexSearcher.search(). 
A profiler might tell you whether that is a bottleneck for your queries.
If it is, there is some code in development that might help
.
In case it turns out that the memory occupied by the BitSet of the filter
is a bottleneck, please check the (very) recent archives of lucene-dev
on BitSet implementation.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie questions

2005-02-11 Thread Paul Jans
I've already ordered Lucene in Action :)

 There is a LuceneRAR project that is still in its
 infancy here: 
 https://lucenerar.dev.java.net/

I will keep an eye on that for sure.

 You can also store a Lucene index in Berkeley DB
 (look at the 
 /contrib/db area of the source code repository)

We're already using Oracle, so would it be possible to
store the index there, thus giving each cluster node
easy access to it. I read about SqlDirectory in the
archives but it looks like it didn't make it to the
API and I don't see it on the contrib page.

I'm more concerned about making the index accessible
rather than transactional consistency, so NFS may be
another option like you mention. I'm curious to hear
about other systems which are clustered and how others
are doing this; lessons learnt and best practices etc.

Thanks again for the help. Lucene looks like a first
class tool.

PJ

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 
 On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
  A couple of newbie questions. I've searched the
  archives and read the Javadoc but I'm still having
  trouble figuring these out.
 
 Don't forget to get your copy of Lucene in Action
 too :)
 
  1. What's the best way to index and handle queries
  like the following:
 
  Find me all users with (a CS degree and a GPA 
 3.0)
  or (a Math degree and a GPA  3.5).
 
 Some suggestions:  index degree as a Keyword field. 
 Pad GPA, so that 
 all of them are the form #.# (or #.## maybe). 
 Numerics need to be 
 lexicographically ordered, and thus padded.
 
 With the right analyzer (see the AnalysisParalysis
 page on the wiki) 
 you could use this type of query with QueryParser:'
 
   degree:cs AND gpa:[3.0 TO 9.9]
 
  2. What are the best practices for using Lucene in
 a
  clustered J2EE environment? A standalone
 index/search
  server or storing the index in the database or
  something else ?
 
 There is a LuceneRAR project that is still in its
 infancy here: 
 https://lucenerar.dev.java.net/
 
 You can also store a Lucene index in Berkeley DB
 (look at the 
 /contrib/db area of the source code repository)
 
 However, most projects do fine with cruder
 techniques such as sharing 
 the Lucene index on a common drive and ensuring that
 locking is 
 configured to use the common drive also.
 
   Erik
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-10 Thread Paul Elschot
On Thursday 10 February 2005 18:44, Luke Shannon wrote:
 Are there any issues with having a bunch of boolean queries and than adding
 them to one big boolean queries (making them all required)?

The 1.4.3 and earlier BooleanScorer has an out of bounds exception
for More than 32 required/prohibited clauses in query.

In the development version this restriction has gone.

The limitation of the maximum clause count (default 1024,
configurable) is still there.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Newbie questions

2005-02-10 Thread Paul Jans
Hi,

A couple of newbie questions. I've searched the
archives and read the Javadoc but I'm still having
trouble figuring these out. 

1. What's the best way to index and handle queries
like the following: 

Find me all users with (a CS degree and a GPA  3.0)
or (a Math degree and a GPA  3.5).

2. What are the best practices for using Lucene in a
clustered J2EE environment? A standalone index/search
server or storing the index in the database or
something else ?

Thank you in advance,
PJ




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for doc without a field

2005-02-04 Thread Paul Elschot
On Friday 04 February 2005 17:29, Bill Tschumy wrote:
 
 On Feb 4, 2005, at 10:19 AM, Bill Tschumy wrote:
 
 
  On Feb 3, 2005, at 2:04 PM, Paul Elschot wrote:
 
  On Thursday 03 February 2005 20:18, Bill Tschumy wrote:
  Is there any way to construct a query to locate all documents 
  without a
  specific field?  By this I mean the Document was created without ever
  having that field added to it.
 
  One way is to add an extra document field containing the field
  names of all (other) indexed fields in the document.
  Assuming there is always a primary key field the query is then:
 
  +fieldnames:primarykeyfield -fieldnames:specificfield
 
  Regards,
  Paul Elschot
 
  Paul,
 
  Thanks for the suggestion, but I need to do this on an existing 
  database as it is.
 
  It just occurred to me that I should try a query on the field with a 
  value of NULL.  Don't know if that will work or not.
 
 Nope, using null as a search value just result in a 
 NullPointerException.

It's not impossible, but the problem is that the term index is first sorted
by field name, then by term text, then by document number, and then
by term position within document.

That means that the index path is no good to query for field name and
document number: you have to check all indexed terms in  between.

Lucene only allows to find the existence of a indexed field, the
indexed terms (field name + term text) in sorted order from a given term,
and the indexed documents of a term, possibly combined with the
with the term positions within each document.

The solution above shortcuts the index path by putting the field name
in place of the term text for a special field.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Rewrite causes BooleanQuery to loose required terms

2005-02-03 Thread Paul Elschot
On Thursday 03 February 2005 11:38, Nick Burch wrote:
 Hi All
 
 I'm using lucene from CVS, and I've discovered the rewriting a 
 BooleanQuery created with the old style (Query,boolean,boolean) method,
 the rewrite will cause the required parameters to get lost.
 
 Using old style (Query,boolean,boolean):
 query = +contents:test* +(class:1.2 class:1.2.*)
 rewritten query = (contents:tester contents:testing contents:tests) 
   (class:1.2 (class:1.2.3 class:1.2.4))
 
 Using new style (Query,BooleanClause.Occur.MUST):
 query = +contents:test* +(class:1.2 class:1.2.*)
 rewritten query = +(contents:tester contents:testing contents:tests) 
   +(class:1.2 (class:1.2.3 class:1.2.4))
 
 Attached is a simple RAMDirectory test to show this. I know that the 
 (Query,boolean,boolean) method is depricated, but should it also be 
 broken?

No.
Currently, the old constructor for BooleanClause does not carry the
old state forward.
The new constructor does carry the new state backward.

I'll post a fix in bugzilla later.

Thanks,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for doc without a field

2005-02-03 Thread Paul Elschot
On Thursday 03 February 2005 20:18, Bill Tschumy wrote:
 Is there any way to construct a query to locate all documents without a 
 specific field?  By this I mean the Document was created without ever 
 having that field added to it.

One way is to add an extra document field containing the field
names of all (other) indexed fields in the document.
Assuming there is always a primary key field the query is then:

+fieldnames:primarykeyfield -fieldnames:specificfield

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Compile lucene

2005-02-02 Thread Paul Elschot
Helen,

On Wednesday 02 February 2005 20:26, Helen Butler wrote:
 Hi
 
 Im trying to Compile Lucene but am encountering the following error on 
typing ant from the root of Lucene-1.4.3
 
 C:\lucene-1.4.3ant
 Buildfile: build.xml
 
 init:
 
 compile-core:
 
 BUILD FAILED
 C:\lucene-1.4.3\build.xml:140: srcdir C:\lucene-1.4.3\src\java does not e=
 xist!

It seems the java source files were not extracted.

How did you obtain the build.xml file?

Once the compilation works, you'll notice that the lucene jar being built
has a 1.5 version number because of an incorrect version number
in the 1.4.3 build.xml.
You need to correct the version property in the build.xml file:
  property name=version value=1.4.3/

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Compile lucene

2005-02-02 Thread Paul Elschot
Helen,

I downloaded lucene-1.4.3.zip myself from one of the mirrors
(http://apache.essentkabel.com/jakarta/lucene/binaries/)

It contains the lucene demo's, and not the java sources.

The lucene-1.4.3.tar.gz there has the same problem.

It seems something is wrong with the 1.4.3 distribution.

When you need the lucene 1.4.3 jar you can download it from the above mirror,
it looks ok. to me.

In case you have done something like this before:

The following command (on a single line) will checkout the source files from cvs
into directory lucene-1.4.3 (make sure that directory is empty beforehand):

cvs -d :pserver:[EMAIL PROTECTED]:/home/cvspublic checkout -r lucene_1_4_3 -d 
lucene-1.4.3 jakarta_lucene

In there you can correct the build.xml file and do:

ant compile

to compile the source code.

Regards,
Paul Elschot


On Wednesday 02 February 2005 20:55, Helen Butler wrote:
 Hi Paul,
 
 Thanks for your quick response.
 
 The Build.xml was obtained from the Lucene-1.4.3.zip that I downloaded from 
 the apache website.
 
 I changed the version in the xml file as you suggested, however the error 
 persists.
 
 Kind Regards,
 Helen Butler
 
 
 -Original Message-
 From: Paul Elschot [EMAIL PROTECTED]
 To: lucene-user@jakarta.apache.org
 Date: Wed, 2 Feb 2005 20:39:01 +0100
 Subject: Re: Compile lucene
 
 Helen,
 
 On Wednesday 02 February 2005 20:26, Helen Butler wrote:
  Hi
  
  Im trying to Compile Lucene but am encountering the following error on 
 typing ant from the root of Lucene-1.4.3
  
  C:\lucene-1.4.3ant
  Buildfile: build.xml
  
  init:
  
  compile-core:
  
  BUILD FAILED
  C:\lucene-1.4.3\build.xml:140: srcdir C:\lucene-1.4.3\src\java does not e=
  xist!
 
 It seems the java source files were not extracted.
 
 How did you obtain the build.xml file?
 
 Once the compilation works, you'll notice that the lucene jar being built
 has a 1.5 version number because of an incorrect version number
 in the 1.4.3 build.xml.
 You need to correct the version property in the build.xml file:
   property name=version value=1.4.3/
 
 Regards,
 Paul Elschot.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Subversion conversion

2005-02-02 Thread Paul Elschot
On Wednesday 02 February 2005 21:20, Erik Hatcher wrote:
 The conversion to Subversion is complete.  The new repository is 
 available to users read-only at:
 
   http://svn.apache.org/repos/asf/lucene/java/trunk


Great. I just checked out the trunk:

Checked out revision 151042.

So much for the few minutes instead of hours,

Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Penalty for storing unrelated field?

2005-01-29 Thread Paul Elschot
On Friday 28 January 2005 22:30, Andy Goodell wrote:
 You should be fine.

For search performance, yes. But the extra field data does slow down
optimization of a modified index because all the field (and index) data
is read and written for that. When the extra data gets bulky, it's normally
better to store it in the file system or in a database.

 On Fri, 28 Jan 2005 15:21:50 -0600, Bill Tschumy [EMAIL PROTECTED] wrote:
   I just want to make sure
  that adding the unrelated field to a single doc won't cause all the
  other documents to increase their storage space. 
  --
 
 I have lots of fields that only occur in one document, but it doesn't
 phase lucene.  Actually when choosing an indexing solution, we chose
 lucene mostly because of its ability to index and store unlimited
 kinds of metadata.
 
 - andy g
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestions for documentation or LIA

2005-01-26 Thread Paul Elschot
On Wednesday 26 January 2005 18:40, Ian Soboroff wrote:
 jian chen [EMAIL PROTECTED] writes:
 
  Just to continue this discussion. I think right now Lucene's retrieval
  algorithm is based purely on Vector Space Model, which is simple and
  efficient.
 
 As I understand it, it's indeed a tf-idf vector space approach, except
 that the queries are structured and as such, the tf-idf weights are
 totaled as a straight cosine among siblings of a BooleanQuery, but
 other query nodes may do things differently, for example, I haven't
 read it but I assume PhraseQueries require all terms present and
 adjacent to contribute to the score.
 
 There is also a document-specific boost factor in the equation which
 is essentially a hook for document things like recency, PageRank, etc
 etc.
 
 You can tweak this by defining custom Similarity classes which can say
 what the tf, idf, norm, and boost mean.  You can also affect the
 term normalization at the query end in BooleanScorer (I think? through
 the sumOfSquares method?).
 
 We've implemented something kind of like the Similarity class but
 based on a model which decsribes a larger family of similarity
 functions.  (For the curious or similarly IR-geeky, it's from Justin
 Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
 need more general hooks than the Lucene Similarity provides.  I think
 those hooks might exist, but I'm not sure I know which classes they're
 in.
 
 I'm also interested in things like relevance feedback which can affect
 term weights as well as adding terms to the query... just how many
 places in the code do I have to subclass or change?

None. Create your own TermQuery instances, set their boosts,
and add them to a BooleanQuery.
 
 It's clear that if I'm interested in a completely different model like
 language modeling the IndexReader is the way to go.  In which case,
 what parts of the Lucene class structure should I adapt to maintain
 the incremental-results-return, inverted list skips, and other
 features which make the inverted search fast?

To keep the speed, the one thing you should keep is the performance of
TermQuery. In case you're interested in changing proximity scores,
the same holds for SpanTermQuery.
For a variation on TermQuery that scores query terms by their density in a
document field you can have a look here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31784

On top of these you can implement your own Scorers, but for Zobel's
similarities you probably won't need much more than what BooleanQuery
provides.
To use the inverted list skips, make sure to implement and use skipTo()
on your scorers.
In case you need larger queries in conjunctive normal form:
+(synA1 synA2 ) +(synB1 synB2 ...) +(synC1 synC2 ...) 
the development version of BooleanQuery might be a bit faster
than the current one.

For an interesting twist in the use of idf please search
for fuzzy scoring changes on lucene-dev at the end of 2004.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filtering w/ Multiple Terms

2005-01-24 Thread Paul Elschot
Jerry,

On Monday 24 January 2005 18:26, Jerry Jalenak wrote:
 I spent some time reading the Lucene in Action book this weekend (great job,
 btw), and came across the section on using custom filters.  Since the data
 that I need to use to filter my hit set with comes from a database, I
 thought it would be worth my effort this morning to write a custom filter
 that would handle the filtering for me.  So, using the example from the book
 (page 210), I've coded an AccountFilter:
 
 public class AccountFilter extends Filter
 {
   public AccountFilter()
   {}
   
   public BitSet bits(IndexReader indexReader)
   throws IOException
   {
   System.out.println(Entering AccountFilter...);
   BitSet bitSet = new BitSet(indexReader.maxDoc());
 
   String[] reportingAccounts = new String[] {0011, 4kfs};
   
   int[] docs = new int[1];
   int[] freqs = new int[1];
   
   for (int i = 0; i  reportingAccounts.length; i++)
   {
   String reportingAccount = reportingAccounts[i];
   if (reportingAccount != null)
   {
   TermDocs termDocs = indexReader.termDocs(new
 Term(account, reportingAccount));
   int count = termDocs.read(docs, freqs);
   if (count == 1)

Unless account is a primary key fied, it's better to loop over the termdocs.

   {
   System.out.println(Setting bit
 on);
   bitSet.set(docs[0]);
   }
   }
   }
   System.out.println(Leaving AccountFilter...);
   return bitSet;
   }
 }
 
 I see where the AccountFilter is setting the cooresponding 'bits', but I end
 up without any 'hits':
 
 Entering AccountFilter...
 Entering AccountFilter...
 Entering AccountFilter...
 Setting bit on
 Setting bit on
 Setting bit on
 Setting bit on
 Setting bit on
 Leaving AccountFilter...
 Leaving AccountFilter...
 Leaving AccountFilter...

I don't see any recursion in your code, but this output
suggests nesting three deep. Something does not add up here.

 ... Found 0 matching documents in 1000 ms
 
 Can anyone tell me what I've done wrong?

Maybe all query hits were filtered out?
Could you compare the docnrs in the bits of the filter with the
unfiltered query hits docnrs?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Paul Elschot
On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
 Kevin A. Burton wrote:
 
  We have one large index right now... its about 60G ... When I open it 
  the Java VM used 940M of memory.  The VM does nothing else besides 
  open this index.
 
 After thinking about it I guess 1.5% of memory per index really isn't 
 THAT bad.  What would be nice if there was a way to do this from disk 
 and then use the a buffer (either via the filesystem or in-vm memory) to 
 access these variables.

It's even documented. From:
http://jakarta.apache.org/lucene/docs/fileformats.html :

The term info index, or .tii file. 
This contains every IndexIntervalth entry from the .tis file, along with its
location in the tis file. This is designed to be read entirely into memory
and used to provide random access to the tis file. 

My guess is that this is what you see happening.
To see the actuall .tii file, you need the non default file format.

Once searching starts you'll also see that the field norms are loaded,
these take one byte per searched field per document.

 This would be similar to the way the MySQL index cache works...

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document 'Context' Relation to each other

2005-01-22 Thread Paul Smith

You wouldn't even need the sequence number.  You'll certainly be 
adding the documents to the index in the proper sequence already 
(right?).  It is easy to random access documents if you know Lucene's 
document ids.  Here's the pseudo-code

- construct an IndexReader
- open an IndexSearcher using the IndexReader
- search, getting Hits back
- for a hit you want to see the context, get hits.id(hit#)
- subtract context size from the id, grab documents using 
reader.document(id)

You don't search for a document by id, but rather jump right to it 
with IndexReader.

Perfect, that's exactly what I was after! It's going to be easier than I 
thought. 

Thanks,
Paul
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Document 'Context' Relation to each other

2005-01-21 Thread Paul Smith
As a log4j developer, I've been toying with the idea of what Lucene 
could do for me, maybe as an excuse to play around with Lucene.

I've started creating a LoggingEvent-Document converter, and thinking 
through how I'd like this utility to work when I came across a question 
I wasn't sure about.

When scanning/searching through logging events, one is usually looking 
for a particular matching event which Lucene does excellently, but what 
a person usually needs is also the context of that matching logging 
event around it. 

With grep, one can use the -CcontextSize argument to grep to provide 
X # of lines around the matching entry. I'd like to be able to do the 
same thing with Lucene.

Now, I could provide a Field to the LoggingEvent Document that has a 
sequence #, and once a user has chosen an appropriate matching event, do 
another search for the documents with a Sequence # between +/- the 
context size. 

My question is, is that going to be an efficient way to do this? The 
sequence # would be treated as text, wouldn't it?  Would the range 
search on an int be the most efficient way to do this?

I know from the Hits documentation that one can retrieve the Document ID 
of a matching entry.  What is the contract on this Document ID?  Is each 
Document added to the Index given an increasing number?  Can one search 
an index by Document ID?  Could one search for Document ID's between a 
range?   (Hope you can see where I'm going here).

If you have any other recommendations about Context searching I would 
appreciate any thoughts.

Many thanks for an excellent API, and kudos to Erik  Otis for a great 
eBook btw.

regards,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Span Query Performance

2005-01-06 Thread Paul Elschot
On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
 Hi all,
 
 I'm currently doing a query similar to the following:
 
 for w in wordset:
 query = w near (word1 V word2 V word3 ... V word1422);
 perform query
 
 and I am doing this through SpanQuery.getSpans(), iterating through the 
 spans and counting
 the matches, which can result in 4782282 matches (essentially I am only 
 after the match count).
 The query works but the performance can be somewhat slow; so I am wondering:
 
 a) Would the query potentially run faster if I used 
 Searcher.search(query) with a custom similarity,
 or do both methods essentially use the same mechanics

It would be somewhat slower, because it loops over the getSpans()
and computes document scores and constructs a Hits from the scores.

 b) Does using a RAMDirectory improve query performance any significant 
 amount.

That depends on your operating system, the size of the index, the amount
of RAM you can use, the file buffering efficiency, other loads on the 
computer ...
 
 c) Is there a faster method to what I am doing I should consider?

Preindexing all word combinations that you're interested in.

Regards,
Paul Elschot
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Span Query Performance

2005-01-06 Thread Paul Elschot
Sorry for the duplicate on lucene-dev, it should have gone to lucene-user 
directly:

A bit more:

On Thursday 06 January 2005 10:22, Paul Elschot wrote:
 On Thursday 06 January 2005 02:17, Andrew Cunningham wrote:
  Hi all,
  
  I'm currently doing a query similar to the following:
  
  for w in wordset:
      query = w near (word1 V word2 V word3 ... V word1422);
      perform query
  
  and I am doing this through SpanQuery.getSpans(), iterating through the 
  spans and counting
  the matches, which can result in 4782282 matches (essentially I am only 
  after the match count).
  The query works but the performance can be somewhat slow; so I am 
wondering:
  
...
  c) Is there a faster method to what I am doing I should consider?
 
 Preindexing all word combinations that you're interested in.
 

In case you know all the words in advance, you could also index a
helper word at the same position as each of those words.
This requires a custom analyzer that inserts the helper word in the
token stream with a zero position increment.
The query then simplifies to:
query = w near helperword
which would probably speed things up significantly.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Deleting index for DB indexing

2004-12-30 Thread Paul
Alternative: create a hashed value which is unique within your DB
(e.g. use md5). Afterwards you can delete documents from the index
with the IndexReader(Term).
Without that additional field you can use the IndexSearcher to
retrieve your documents from the index and then use
IndexReader(DocNum) to delete these documents

Paul


On Thu, 30 Dec 2004 07:18:39 -0800 (PST), mahaveer jain
[EMAIL PROTECTED] wrote:
 Hi All,
 
 I am using lucene for my DB indexing. I have 2 columns which are Keyword.
 Now I want to delete my index based on this 2 keyword.
 
 Is it possible ? If no. What is other alternative ?
 
 Thanks
 Mahaveer
 
 
 -
 Do you Yahoo!?
  Yahoo! Mail - 250MB free storage. Do more. Manage less.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser, default operator

2004-12-29 Thread Paul
Hi,
the following code
 QueryParser qp = new QueryParser(itemContent, analyzer);
 qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND);
 Query query = qp.parse(line, itemContent, analyzer);
doesn't produce the expected result because a query foo bar results in:
 itemContent:foo itemContent:bar
where as a foo AND bar results in
 +itemContent:foo +itemContent:bar

If I understand the default operator correctly than the first query
should have been expanded to the same as the latter one, isn't it?

thanks a lot!
Paul

P.S. I sent the mail yesterday as well, but I didn't see it in the
mailinglist, I hope it doesn't appear twice now.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document boost not showing up in Explanation

2004-12-28 Thread Paul Elschot
On Tuesday 28 December 2004 08:37, Erik Hatcher wrote:
 
 On Dec 27, 2004, at 9:54 PM, Vikas Gupta wrote:
  I am using lucene-1.4.1.jar(with nutch). For some reason, the effect of
  document boost is not showing up in the search results. Also, why is it
  not a part of the Explanation
 
 It actually is part of it
 
  Below is the 'explanation' of a sample query solar. I don't see 
  the
  boost value (1.5514448) being used at all in the calculation of the
  document score - from the 'explanation' below and also from the 
  quality of
  the search.
 
  How can I see the effect of document boost?
 
 Document boost is not stored in the index as-is.  A single 
 normalization factor is stored per-field and is computed at indexing 
 type using field and document boosts, as well as the length 
 normalization factor (and perhaps other factors I'm forgetting at the 
 moment?).

This also means that the explanation can only show a field normalisation
factor as it is available from the index.

One reason that boosting does necessarily not show up in the quality of
the search is that the byte encoding allows only 256 different values to
be stored.
The value stored in the index (called the norm) is the product of the
document boost factor, the field boost factor and the lengthNorm() of
the field.
For the search results to actually change because of the boost factors,
it is necessary that this stored factor is changed to another one of
the 256 possible.

The range of possible values stored in the index is roughly from
7x10^9 to 2x10^-9 . See:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#setBoost(float)
and
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#encodeNorm(float)

The range of stored values (excluding the zero special case) is
about 7x10^9 / 2x10^-9 = 3.5x10^18. The 10 log of that is about 18.5 .
Per factor 10 there are about 255/18.5 = 13.8 encoded values.
So, a minimum boost factor that should change a document
score is about  log(13.8)/log(10) = 1.14 .
Since the default lengthNorm is the square root, a field length
should change by at least the square of that (roughly a factor 1.3)
to change the document score (assuming no hits in 
the changed field text.)

Finally, a change in document score only influences the document
ordering in the search results when another document has a score
that is within the range of the change.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser, default operator

2004-12-28 Thread Paul
Hi,
the following code
  QueryParser qp = new QueryParser(itemContent, analyzer);
  
qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND);
  Query query = qp.parse(line, itemContent, analyzer);
doesn't produce the expected result because a query foo bar results in:
  itemContent:foo itemContent:bar
where as a foo AND bar results in
  +itemContent:foo +itemContent:bar

If I understand the default operator correctly than the first query
should have been expanded to the same as the latter one, isn't it?

thanks a lot!
Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word co-occurrences counts

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote:
 Hi all,
 
 I have a curious problem, and initial poking around with Lucene looks
 like it may only be able to half-handle the problem.
 
  
 
 The problem requires two abilities:
 
 1.To be able to return the number of times the word appears in all
 the documents (which it looks like lucene can do through IndexReader) 
 2.To be able to return the number of word co-occurrences within
 the document set (ie. How many times does computer appear within 50
 words of  dog) 

  
 
 Is the second point possible?

You can use the standard query parser with a query like this:
dog computer~50
This query is not completely symmetric in the distance computation:
when computer occurs before dog, the allowed distance is 49, iirc.

There is also a SpanNearQuery for more generalized and flexible
distance queries, but this is not supported by the query parser,
so you'll have to construct these queries in your own program code.

In case you have non standard retrieval requirements, eg. you only
need the number of hits and no further information from the matching
documents, you may consider using your own HitCollector on the
lower level search methods.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 08:13, Gururaja H wrote:
 Hi Chuck Williams,
  
 Thanks much for the reply.
  
 If your queries are all BooleanQuery's of
 TermQuery's, then this is very simple. Iterate down the list of
 BooleanClause's and count the number whose score is  0, then divide
 this by the total number of clauses. Take a look at
 BooleanQuery.BooleanWeight.explain() as it does this (along with
 generating the rest of the explanation). If you support the full Lucene
 query language, then you need to look at all the query types and decide
 what exactly you want to compute (as coord is not always well-defined).
  
 We are supporting full Lucene query language.  
  
 My request is, assuming queries are all BooleanQuery please
 post the implementation source code for the same.  ie to calculate the 
coord() method input parameters overlap and maxOverlap.

I don't have the code, but I can give an overview of possible
steps:

First inherit from BooleanScorer to implement a score() method that
returns only the coord() value (preferably a precomputed one).
Then inherit from BooleanQuery.BooleanWeight to return the above
Scorer.
Then inherit from BooleanQuery to use the above Weight in createWeight().
Then inherit from QueryParser to use the above Query in getBooleanQuery().
Finally use such a query in a search: the document scores will be
the coord() values.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index size doubled?

2004-12-21 Thread Paul Elschot
On Tuesday 21 December 2004 05:49, aurora wrote:
 I'm testing the rebuilding of the index. I add several hundred documents,  
 optimize and add another few hundred and so on. Right now I have around  
 7000 files. I observed after the index gets to certain size. Everytime  
 after optimize, the are two files roughly the same size like below:
 
 12/20/2004  01:57p  13 deletable
 12/20/2004  01:57p  29 segments
 12/20/2004  01:53p  14,460,367 _5qf.cfs
 12/20/2004  01:57p  15,069,013 _5zr.cfs
 
 The index total index is double of what I expect. This is not always  
 reproducible. (I'm constantly tuning my program and the set of document).  
 Sometime I get a decent single document after optimize. What was happening?

Lucene tried to delete the older version (_5cf.cfs above), but got an error
back from the file system. After that it has put the name of that segment in
the deletable file, so it can try later to delete that segment.

This is known behaviour on FAT file systems. These randomly take some time
for themselves to finish closing a file after it has been correctly closed by
a program.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MergerIndex + Searchables

2004-12-21 Thread Paul Elschot
Karthik,

On Tuesday 21 December 2004 09:04, Karthik N S wrote:
 Hi Guys
 
 Apologies...
 
 
 I have several MERGERINDEXES [  MGR1,MGR2,MGR3].
 
 for searching across these MERGERINDEXES I use the following Code
 IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK];
 
 for(int all=0;allCNTINDXDBOOK;all++){
 indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]);
  System.out.println(all +  ADDED TO SEARCHABLES  + INDEXEDBOOKS[all]);
 }
 
 MultiSearcher searcher = new MultiSearcher(indexToSearch);
 
 
 Question :
 
 When on Search Process , How to Display that this relevan  Document Id
 Originated from Which MRG???
 
 [ Some thing like this : -  Search word  'ISBN12345' is avalible from
 MRGx ]

I think you are looking for the methods subSearcher() and subDoc() on
MultiSearcher.

Regards,
Paul Elschot




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimising A Security Filter

2004-12-20 Thread Paul Elschot
On Sunday 19 December 2004 23:05, Steve Skillcorn wrote:
 Hello All;
 
 I bought the Lucene in Action ebook, which is
 excellent and I can strongly recommend.  One question
 that has arisen from the book though is custom
 filters.
 
 I have the situation where the text of my docs is in
 Lucene, but the permissions are in my RDBMS.  I can
 write a filter (in fact have done so) that loops
 through the documents in the passed IndexReader and
 queries the DB to detect if the user is permissioned
 for them, setting the relevant BitSet.  My results are
 then paged ( last | next ) to a web page.
 
 Does the IndexReader that is passed to the “bits”
 method of the filter represent the entire index, or
 just the results that match the query?

The IndexReader represents the entire index.

 Is not worrying about filters and simply checking the
 returned Hit List before presenting a sensible
 approach?

That's is done by the IndexSearcher.search() methods
that take a filter argument.
 
 I can see the point to filters as presented in the
 Lucene in Action ISBN example, but are they a good
 approach where they could end up laboriously marking
 the entire index as True?

The filter is checked only for search results on the query
over the whole index.

The bit filters generally work well, except when you need
a lot of very sparse filters and memory is a concern.

Regards,
Paul Elschot
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Paul Elschot
On Monday 20 December 2004 15:09, Gururaja H wrote:
 Hi,
  
 But, How to calculate the coord() fraction ?  I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as codeoverlap / maxOverlap/code. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched 
document(s) ?

In case you only want the coordination factor to have more influence
in the order of your search results you can use a Similarity with
a coord() function that has a power higher than 1:

  public float coord(int overlap, int maxOverlap) {
return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER);
  }

I'd first try values between 3.0f and 5.0f for SOME_POWER.

The searching code precomputes all coord values once per query
per search, so there is no need to worry about the computing efficiency.

This has the advantage that the other scoring factors are still used
for ranking.

Since the other factors can vary quite a bit, it is difficult to guarantee
that any coord() implementation will provide a score that sorts by the
number of matching clauses. Higher powers as above can come
a long way, though.

Regards,
Paul Elschot


  
 Thanks,
 Gururaja
 
 Mike Snare [EMAIL PROTECTED] wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the document(s) using 
Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 
4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 
40%
  (doc#1).
  
  Any help on how to go about doing this ?
  
  Thanks,
  Gururaja
  
  
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
   
 -
 Do you Yahoo!?
  All your favorites on one personal page – Try My Yahoo!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Permissioning Documents

2004-12-10 Thread Paul Elschot
On Friday 10 December 2004 07:10, Steve Skillcorn wrote:
 Hi;
  
 I'm currently using Lucene (which I am extremely impressed with BTW) to
 index a knowledge base of documents.  One issue I have is that only certain
 documents are available to certain users (or groups).  The number of
 documents is large, into the 100,000s, and the number of uses can be into
 the 1000s.  Obviously, the users permissioned to see certain documents can
 change regularly, so storing the user id's in the Lucene document is
 undesirable, as a permission change could mean a delete and re-add to
 potentially 100s of documents.
  
 Does anyone have any guidance as to how I should approach this?

A typical solution would be to use a Filter for each user group.
Each Filter would be built from categories indexed with the documents.
The moment to build a group Filter could be the first time a user from
a group queries an index after it is opened.
Filters can be cached, see the recent discussion on CachingWrappingFilter
and friends.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieving all docs in the index

2004-12-09 Thread Paul Elschot
On Thursday 09 December 2004 21:18, Ravi wrote:
 That was exactly my original question. I was wondering if there are
 alternatives to this approach.  

In case you need only a few of the top ranking documents,
and the documents are to be sorted by date anyway,
you might consider to search each of the dates in sorted
order separately until you have enough results.

In that way there is no need to use a field with some constant
value. Nonetheless, I can recommend to have a special field
containing all the field names for a document. As all
docs normally contain a primary key, the name of the primary
key field can serve as the constant value.

Regards,
Paul Elschot

 
 -Original Message-
 From: Aviran [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, December 09, 2004 2:08 PM
 To: 'Lucene Users List'
 Subject: RE: Retrieving all docs in the index
 
 In this case you'll have to add another field with a fixed value to all
 the documents and query on that field
 
 
 Aviran
 http://www.aviransplace.com
 
 -Original Message-
 From: Ravi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 09, 2004 14:04 PM
 To: Lucene Users List
 Subject: RE: Retrieving all docs in the index
 
 
 I'm sorry I don't think I articulated my question well. We use a date
 filter
 to sort the search results. This works fine when te user provides some
 search criteria. But if he gives an empty search criteria, we need to
 return
 all the documents in the index in the given date range sorted by date.
 So I
 was looking for a query that returns me all documents in the index and
 then
 I want to apply the date filter on it.  
 
 
 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, December 09, 2004 1:55 PM
 To: Lucene Users List
 Subject: Re: Retrieving all docs in the index
 
 On Dec 9, 2004, at 1:35 PM, Ravi wrote:
   Is there any other way to extract all documents from an index apart
  from adding an additional field with the same value to all documents 
  and then doing a term query on that field with the common value?
 
 Of course.  Have a look at the IndexReader API.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene in action ebook

2004-12-09 Thread Paul Smith
synchronized(luceneEbook){
luceneEbook.wait();
}
Just waiting for the notifyAll()
Kevin A. Burton wrote:
Erik Hatcher wrote:
I have the e-book PDF in my possession. I have been prodding Manning 
on a daily basis to update the LIA website and get the e-book 
available. It is ready, and I'm sure that its just a matter of them 
pushing it out. There may be some administrative loose ends they are 
tying up before releasing it to the world. It should be available any 
minute now, really. :)

Send off a link to the list when its out...
We're all holding our breath ;)
(seriously)
Kevin
--
*Paul Smith
*Software Architect

*Aconex
* 31 Drummond Street, Carlton, VIC 3053, Australia
*Tel: +61 3 9661 0200  *Fax: +61 3 9654 9946
Email: [EMAIL PROTECTED]  www.aconex.com**
This email and any attachments are intended solely for the addressee. 
The contents may be privileged, confidential and/or subject to copyright 
or other applicable law. No confidentiality or privilege is lost by an 
erroneous transmission. If you have received this e-mail in error, 
please let us know by reply e-mail and delete or destroy this mail and 
all copies. If you are not the intended recipient of this message you 
must not disseminate, copy or take any action in reliance on it. The 
sender takes no responsibility for the effect of this message upon the 
recipient's computer system.**

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: restricting search result

2004-12-04 Thread Paul Elschot
Paul, 

On Friday 03 December 2004 23:31, you wrote:
 Hi,
 how yould you restrict the search results for a certain user? I'm

One way to restrict results is by using a Filter.

 indexing all the existing data in my application but there are certain
 access levels so some users should see more results then an other.
 Each lucene document has a field with an internal id and I want to
 restrict on that basis. I tried it with adding a long concatenation of
 my ids (+locationId:1 +locationId:3 + ...) but this throws a More
 than 32 required/prohibited clauses in query. exception.
 Any suggestions?

Using a + before each term requires them all, ie. uses AND, which
would normally have an empty result for an Id field.
You might prefer this query concatenation:

+(locationId:1 locationId:3 ...)

It effectively OR's the locationId content query and requires
only one of the terms to match.

In this case using a Filter would probably be better, though.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: restricting search result

2004-12-04 Thread Paul
The thing with the different indexes sound too complecated because the
users (and their rights) as well as the index itself change quite
often.

 One way to restrict results is by using a Filter.

but a filter is applied after the whole search is performed, isn't it?
I thought it might be faster to restrict the search space in advance


 Using a + before each term requires them all, ie. uses AND, which
 would normally have an empty result for an Id field.

d'oh, yes of course..

 You might prefer this query concatenation:
 
 +(locationId:1 locationId:3 ...)

ok, that sounds very nice and works fine. But I will have a closer
look at the filter as well.

Thank you all
Paul


P.S. someone without gmail account? mail me

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: restricting search result

2004-12-04 Thread Paul Elschot
On Saturday 04 December 2004 15:44, Erik Hatcher wrote:
 On Dec 4, 2004, at 6:44 AM, Paul wrote:
  One way to restrict results is by using a Filter.
 
  but a filter is applied after the whole search is performed, isn't it?
 
 Incorrect.  A filter is applied *before* the search truly occurs - in 
 other words it reduces the search space.

Currently a filter is applied during search, after the document
score is computed, but before a document is added to the search results.

In practice, the score computation is much less work than the I/O, so
a filter does reduce the search space.

A filter might also be used to reduce the I/O for searching, but Lucene
doesn't do that now, probably because there was little to gain.


Regards,
Paul Elschot.

P.S. The code doing the filtering is in IndexSearcher.java, from line 97.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter.optimize and memory usage

2004-12-03 Thread Paul Elschot
On Friday 03 December 2004 08:43, Paul Elschot wrote:
 On Friday 03 December 2004 07:50, Chris Hostetter wrote:
...
  So, If I'm understanding you (and the javadocs) correctly, the real key
  here is maxMergeDocs.  It seems like addDocument will never merge a
  segment untill maxMergeDocs have been added? ... meaning that I need a
  value less then the default (Integer.MAX_VALUE) if I want IndexWriter to
  do incrimental merges as I go ...
  
  ...except...
  
  ...if that were the case, then exactly is the meaning of mergeFactor?

oops correction=minMergeDocs should be replaced by mergeFactor:

 maxMergeDocs controls the sizes of the intermediate segments
 when adding documents.
 With maxMergeDocs at default, adding a document can take as much time as
: (and have the same effect as) optimize.  Eg. with mergeFactor at 10, the
 1000'th added document will create a segment of size 1000.
 With maxMergeDocs at a lower value than 1000, the last merge (of the 10
 segments with 100 docs each) will not be done.
: optimize() uses mergeFactor for its final merges, but it ignores
 maxMergeDocs. 

/oops

Meanwhile these fields have been deprecated in the development
version for set... methods.
Setting minMergeDocs is is deprecated and to be replaced by
setMaxBufferedDocs(). The javadoc for this reads:

Determines the minimal number of documents required before the buffered 
in-memory documents are merging and a new Segment is created. Since Documents 
are merged in a RAMDirectory, large value gives faster indexing. At the same 
time, mergeFactor limits the number of files open in a FSDirectory.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



restricting search result

2004-12-03 Thread Paul
Hi,
how yould you restrict the search results for a certain user? I'm
indexing all the existing data in my application but there are certain
access levels so some users should see more results then an other.
Each lucene document has a field with an internal id and I want to
restrict on that basis. I tried it with adding a long concatenation of
my ids (+locationId:1 +locationId:3 + ...) but this throws a More
than 32 required/prohibited clauses in query. exception.
Any suggestions?
thx!
Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene perform ranking in the retrieved set?

2004-11-30 Thread Paul Elschot
On Tuesday 30 November 2004 18:46, Xiangyu Jin wrote:
 
 THis might be a stupid question.
 
 When perform retrieval for a query, deos Lucene first get
 a subset of candidate matches and then perform the ranking
 on the set? That is, similarity calculation is performed only
 on a subset of the docuemnts to the query.

Yes, Lucene uses  an inverted index for this.

 If so, from which module could I get those candidate docs,
 then I can perform my own similarity calculations (since
 I might need to rewrite the normalization factor, so
 only modify the similarity model seems will not
 work).

To change the normalisation you may consider implementing
your own Weight:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Weight.html
For some example implementations of Weight the Lucene source
code in the org.apache.lucene.search package is the best resource.

Using your own Weight also requires a subclass of Query that returns
this weight in the createWeight() method.

 Or, is there document describe the produre of how Lucene
 perform search?

This describes the scoring:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
See also the DefaultSimilarity.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: URGENT: Help indexing large document set

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 00:37, John Wang wrote:
 Hi:
 
I am trying to index 1M documents, with batches of 500 documents.
 
Each document has an unique text key, which is added as a
 Field.KeyWord(name,value).
 
For each batch of 500, I need to make sure I am not adding a
 document with a key that is already in the current index.
 
   To do this, I am calling IndexSearcher.docFreq for each document and
 delete the document currently in the index with the same key:
  
while (keyIter.hasNext()) {
 String objectID = (String) keyIter.next();
 term = new Term(key, objectID);
 int count = localSearcher.docFreq(term);

To speed this up a bit make sure that the iterator gives
the terms in sorted order. I'd use an index reader instead
of a searcher, but that will probably not make a difference.

Adding the documents can be done with multiple threads.
Last time I checked that, there was a moderate speed up
using three threads instead of one on a single CPU machine.
Tuning the values of minMergeDocs and maxMergeDocs
may also help to increase performance of adding documents.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 01:31, Ken McCracken wrote:
 Hi,
 
 Thanks the pointers in your replies.  Would it be possible to include
 some sort of accrual scorer interface somewhere in the Lucene Query
 APIs?  This could be passed into a query similar to
 MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
 according to the implementor's discretion, to compute the overall
 score for a document.

The DisjunctionScorer is currently not part of Lucene.
You might try and subclass Similarity to provide what you need and
pass that to your Query.

I'm using a few subclasses of DisjunctionScorer to provide the actual
score value ao. for max and sum.
For each of these scorers,  I use a separate Query and Weight.
This gives a parallel class hierarchy for Query, Weight and Scorer.

I guess it's time to have a look at Design Patterns and/or Refactoring
on how to get rid of the parallel class hierarchy. That could also
involve some sort of accrual scorer and Lucene's Similarity.

Regards,
Paul Elschot

 -Ken
 
 On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] 
wrote:
  On Friday 12 November 2004 22:56, Chuck Williams wrote:
  
  
   I had a similar need and wrote MaxDisjunctionQuery and
   MaxDisjunctionScorer.  Unfortunately these are not available as a patch
   but I've included the original message below that has the code (modulo
   line breaks added by simple text email format).
  
   This code is functional -- I use it in my app.  It is optimized for its
   stated use, which involves a small number of clauses.  You'd want to
   improve the incremental sorting (e.g., using the bucket technique of
   BooleanQuery) if you need it for large numbers of clauses.
  
  When you're interested, you can also have a look here for
  yet another DisjunctionScorer:
  http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
  
  It has the advantage that it implements skipTo() so that it can
  be used as a subscorer of ConjunctionScorer, ie. it can be
  faster in situations like this:
  
  aa AND (bb OR cc)
  
  where bb and cc are treated by the DisjunctionScorer.
  When aa is a filter this can also be used to implement
  a filtering query.
  
  
  
  
   Re. Paul's suggested steps below, I did not integrate this with query
   parser as I didn't need that functionality (since I'm generating the
   multi-field expansions for which max is a much better scoring choice
   than sum).
  
   Chuck
  
   Included message:
  
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Monday, October 11, 2004 9:55 PM
   To: [EMAIL PROTECTED]
   Subject: Contribution: better multi-field searching
  
   The files included below (MaxDisjunctionQuery.java and
   MaxDisjunctionScorer.java) provide a new mechanism for searching across
   multiple fields.
  
  The maximum indeed works well, also when the fields differ a lot length.
  
  Regards,
  Paul
  
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Numeric Range Restrictions: Queries vs Filters

2004-11-23 Thread Paul Elschot
Chris,

On Tuesday 23 November 2004 03:25, Hoss wrote:
 (NOTE: numbers in [] indicate Footnotes)
 
 I'm rather new to Lucene (and this list), so if I'm grossly
 misunderstanding things, forgive me.
 
 One of my main needs as I investigate Search technologies is to restrict
 results based on Ranges of numeric values.  Looking over the archives of
 this list, it seems that lots of people have run into problems dealing
 with this.  In particular, whenever someone asks a question about Numeric
 Ranges the question seem to always involve one (or more) of the
 following:
 
(a) Lexical sorting puts 11 in the range 1 TO 5
(b) Dates (or Dates and Times)
(c) BooleanQuery$TooManyClauses Exceptions
(d) Should I use a filter?

FWIW, the javadoc of the development version of
BooleanQuery.maxClauseCount reads:

  The maximum number of clauses permitted. Default value is 1024. Use the  
  org.apache.lucene.maxClauseCount system property to override. 

  TermQuery clauses are generated from for example prefix queries and
  fuzzy queries. Each TermQuery needs some buffer space during search,
  so this parameter indirectly controls the maximum buffer requirements for
  query search. Normally the buffers are allocated by the JVM. When using
  for example MMapDirectory the buffering is left to the operating system.

MMapDirectory uses memory mapped files for the index.

It would be useful to also provide a reference to filters (DateFilter)
and to LongField in case it is added to the code base.

...
 The Query API on the other hand ... I freely admit, that I can't make
 heads or tails out of it.  I don't even know where I would begin to try
 and write a new subclass of Query if I wanted to.

In a nutshell:

A Query either rewrites to another Query, or it provides a Weight.
A Weight first does normalisation and then provides a Scorer
to be used during search.

RangeQuery is a good example:

A RangeQuery rewrites to a BooleanQuery over TermQuery's
for the matching terms.
A BooleanQuery provides a BooleanScorer via its Weight.
A TermQuery provides a TermScorer via its Weight.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



experiences with PDF files

2004-11-23 Thread Paul
Hi,
I read a lot of mails about the time consuming pdf-parsing and tried
myself some solutions. My example PDF file has 181 pages in 1,5 MB
(mostly text nearly no grafics).
-with pdfbox.org's toolkit it took 17m32s to parseread it's content
-after installing ghostscript and ps2text / ps2ascii my parsing failed
after page 54 and 2m51s because of irregular fonts
-installing XPDF and using it's tool pdftotext parsing completed after
7-10seconds

My machine is a Celeren 1700 with VMWare Workstation 3.2 (128 MB
assigned) and Linux Suse 7.3.

I will parse my pdf files with xpdf and something like
Runtime.getRuntime().exec(pdftotext -nopgbrk -raw +pdfFileName+
+txtFileName);


Paul

P.S. look at http://www.jguru.com/faq/view.jsp?EID=1074237 for links and tipps

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



retrieving added document

2004-11-23 Thread Paul
Hi,
I'm creating a document and adding it with a writer to the index. For
some reason I need to add data to this specific document later on
(minutes, not hours or days). Is it possible to retrieve it and add
additonal data?
I found the document(int n) - method within the IndexReader (btw: the
description makes no sense for me: Returns the stored fields of the
nth Document in this index. - but it returns a Document and not a
list of fields..) but where do I get that number from? (and the
numbers change, I know..)

thanks for any help

Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using multiple analysers within a query

2004-11-22 Thread Paul Elschot
On Monday 22 November 2004 05:02, Kauler, Leto S wrote:
 Hi Lucene list,
 
 We have the need for analysed and 'not analysed/not tokenised' clauses
 within one query.  Imagine an unparsed query like:
 
 +title:Hello World +path:Resources\Live\1
 
 In the above example we would want the first clause to use
 StandardAnalyser and the second to use an analyser which returns the
 term as a single token.  So a parsed result might look like:
 
 +(title:hello title:world) +path:Resources\Live\1
 
 Would anyone have any suggestions on how this could be done?  I was
 thinking maybe the QueryParser would have to be changed/extended to
 accept a separator other than colon :, something like = for example
 to indicate this clause is not to be tokenised.  Or perhaps this can all
 be done using a single analyser?

Overriding QueryParser.getFieldQuery() might work for you.
It is given the field and the query text so an analyzer can be chosen
depending on the field.
In case you don't use the latest cvs head, it may be worthwhile to
have a look. Some of the getFieldQuery methods have been
deprecated, but I don't know when.

Regards,
Paul.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and SVD

2004-11-18 Thread Paul Elschot
On Wednesday 17 November 2004 23:57, DES wrote:
 Hi
 
 I need some kind of implementation of SVD (singular value decomposition) or 
 LSI with Lucene engine. Have anyone any ideas how to create a query table 
 for decomposition? The table must have documents as rows and terms as 
 columns, if a term is presented in the docuement, the corresponding field 
 contains 1 and a 0 if not. Then the SVD will be applied to this table, 

From Lucene, with TermVector and field norm, one could use the term
density instead of a presence bit.

 and with first 2 columns docuemnts will be displayed in a 2D-space.
 Does anyone work on a project like this?

I don't know. Is there a good SVD package for Java?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: boolean/set operations on lucene queries

2004-11-18 Thread Paul Elschot
On Thursday 18 November 2004 16:57, Rupinder Singh Mazara wrote:
 hi all
 
  I needed some help in solving the following problem
  a user executes query1 and query2
 
  both the queries( not result sets ) get stored, over time the user
  wants to find
  which documents from query1 are common to documents in query2 , basicall a
 intersect of the results of query1 with query2
 
 
  and similarly the union and difference between the results of query1 and
 query2
 
  without having to run the queries and storing the results into some kind of
 datastructure
  does lucene provide some capabilities, i was reading about QueryFilter,

The queries can be added as clauses to a BooleanQuery.
Such clauses can be optional, required or prohibited.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help with filtering

2004-11-17 Thread Paul Elschot
On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
 Hello,
 
 I have been using DateFilter to limit my search results to a certain date
 range. I am now asked to replace this filter with one where my search 
results
 have document IDs greater than a given document ID. This document ID is
 assigned during indexing and is a Keyword field.
 
 I've browsed around the FAQs and archives and see that I can either use
 QueryFilter or BooleanQuery. I've tried both approaches to limit the 
document
 ID range, but am getting the BooleanQuery.TooManyClauses exception in both
 cases. I've also tried bumping max number of clauses via 
setMaxClauseCount(),
 but that number has gotten pretty big.
 
 Is there another approach to this? ...

Recoding DateFilter to a DocumentIdFilter should be straightforward.

The trick is to use only one document enumerator at a time for all
terms. Document enumerators take buffer space, and that is the
reason why BooleanQuery has an exception for too many clauses.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: COUNT SUBINDEX [IN MERGERINDEX]

2004-11-17 Thread Paul Elschot
On Wednesday 17 November 2004 07:10, Karthik N S wrote:
 Hi guy's
 
 
 Apologies.
 
 
   So  A Mergeed Index is again a Single [ addition of subIndexes... ),
 
  If that case , If One of the Field Types is of  type   'Field.Keyword'
 whic is Unique across the subIndexes [Before Merging].
 
  and If I want to Count this Unique Field in a MergerIndex  [After i'ts been
 Merged ] How do I do this Please.

IndexReader.numDocs() will give the number of docs in an index.

Lucene has no direct support for unique fields. After merging, if the
same unique field value occurs in both source indexes, the merged
index will contain two documents with that value.
In case one wants to merge into unique field values, the non unique
values in one of the source indexes need to be deleted before merging.

See IndexReader.termDocs(term) on how to get the document numbers
for (unique) terms via a TermDocs, and IndexReader.delete(docNum)
for deleting docs.

Regards,
Paul.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - TooManyClauses Issue

2004-11-16 Thread Paul Elschot
On Tuesday 16 November 2004 21:35, Joe Krause wrote:
 Hey Folks, I just inherited a deployed Lucene based application that
 started throwing the following exception:
 
 org.apache.lucene.search.BooleanQuery$TooManyClauses
...
 I did some research regarding this error and found out that the default
 number of clauses a BooleanQuery can contain are 1024 (a limitation, but
 one that seems reasonable to work with). I outputted the contents of the
 org.apache.lucene.search.Query object and the
 org.apache.lucene.search.Sort objects right before I sent them to the
 org.apache.lucene.search.IndexSearcher - to see if there are too many
 clauses being accidentally produced. This is what I get:
 
 2004-11-16 12:09:40,302 DEBUG  com.multivision.util.search.HitIndex -
 Query = +(affiliate:teeth market:teeth dma_rank:teeth program:teeth
 station:teeth text:teeth) +air_date:[040101 TO 0411162359]
 
 2004-11-16 12:09:40,302 DEBUG com.multivision.util.search.HitIndex -
 Sort = air_date!,dma_rank
 
 So there appears to be far fewer than 1024 clauses. Is there any other
 reasons why I would be getting this exception? I am new to Lucene, so at
 this point I am stumped.

The range query:
+air_date:[040101 TO 0411162359]
is almost certainly causing your problems. It expands further to all terms
in the range. Several solutions to this have been discussed earlier,
ao. splitting dates into day and time components.
Once you approach 1000 days, you'll get the same problem again,
so you might want to use a filter for the dates.
See DateFilter and the archives on MMDD.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-13 Thread Paul Elschot
On Friday 12 November 2004 22:56, Chuck Williams wrote:
 I had a similar need and wrote MaxDisjunctionQuery and
 MaxDisjunctionScorer.  Unfortunately these are not available as a patch
 but I've included the original message below that has the code (modulo
 line breaks added by simple text email format).

 This code is functional -- I use it in my app.  It is optimized for its
 stated use, which involves a small number of clauses.  You'd want to
 improve the incremental sorting (e.g., using the bucket technique of
 BooleanQuery) if you need it for large numbers of clauses.

When you're interested, you can also have a look here for
yet another DisjunctionScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785

It has the advantage that it implements skipTo() so that it can 
be used as a subscorer of ConjunctionScorer, ie. it can be
faster in situations like this:

aa AND (bb OR cc)

where bb and cc are treated by the DisjunctionScorer.
When aa is a filter this can also be used to implement
a filtering query.

 
 Re. Paul's suggested steps below, I did not integrate this with query
 parser as I didn't need that functionality (since I'm generating the
 multi-field expansions for which max is a much better scoring choice
 than sum).
 
 Chuck
 
 Included message:
 
 -Original Message-
 From: Chuck Williams [mailto:[EMAIL PROTECTED] 
 Sent: Monday, October 11, 2004 9:55 PM
 To: [EMAIL PROTECTED]
 Subject: Contribution: better multi-field searching
 
 The files included below (MaxDisjunctionQuery.java and
 MaxDisjunctionScorer.java) provide a new mechanism for searching across
 multiple fields.

The maximum indeed works well, also when the fields differ a lot length.
 
Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Paul Elschot
On Saturday 13 November 2004 09:16, Sanyi wrote:
  - leave the current implementation, raising an exception;
  - handle the exception and limit the boolean query to the first 1024
  (or what ever the limit is) terms;
  - select, between the possible terms, only the first 1024 (or what
  ever the limit is) more meaningful ones, leaving out all the others.
 
 I like this idea and I would finalize to myself like this:
 I'd also create a default rule for that to avoid handling exceptions for 
people who're happy with
 the default behavior:
 
 Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
 it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..

Wouldn't it be counterintuitive to only use the longest matches
for truncations?
To have only longer matches one can also use queries with
multiple ? characters, each matching exactly one character.

I think it would be better encourage the users to use longer
and maybe also more prefixes. This gives more precise results
and is more efficient to execute.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Paul Elschot
On Friday 12 November 2004 07:57, Sanyi wrote:
  That's the point: there is no query optimizer in Lucene.
 
 Sorry, I'm not very much into Lucene's internal Classes, I'm just telling 
your the viewpoint of a
 user. You know my users aren't technicians, so answers like yours won't make 
them happy.
 They will only see that I randomly don't allow them to search (with the 1024 
limit). They won't
 understand why am I displaying Please restrict your search a bit more.. 
when they've just
 searched for dodge AND vip* and there are only a few documents mathcing 
this criteria.
 
 So, is the only way to make them able to search happily by setting the max. 
clause limit to
 MaxInt?!

The problem is that there is a lot of freedom in choosing a query, but there
is a limited amount of resources available to search each query.

It is normally possible to reduce the numbers of such complaints a lot 
by imposing a minimum prefix length and eg. doubling or tripling the max. nr.
of clauses.

This reduces the freedom of the users because their queries
must be (a bit) more specific. The actual tradeoff depends on the user
requirements and the time and memory available on the server,
so the users get what they pay for.

Imposing a minimum prefix length can be done by overriding the method
in QueryParser that provides a prefix query.

Regards,
Paul Elschot



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-12 Thread Paul Elschot
On Friday 12 November 2004 20:48, Ken McCracken wrote:
 Hi,
 
 I am looking at the Similarity class overview, and wondering if I can
 replace the SUM operator with a MAX operator, or any other operator
 (across the terms in a query).
 
 For example, if I search for car OR automobile, a BooleanScorer is
 used to add the values from each subexpression together.  In the
 BooleanScorer from lucene_1_4_final, in the inner class Collector, we
 have in the collect(...) method, the line
 
  bucket.score += score; // increment score
 
 that I may want replace with a MAX operator such as 
 
  if (score  bucket.score) bucket.score = score;// take the max
 
 I may also want to keep track of both the max and the sum, by
 extending the inner class Bucket.
 
 Do you have any suggestions on how to implement such a change? 
 Ideally, I would like to have the ability to define my choice of
 scoring algorithm at search time (at run time), and use the Lucene SUM
 scorer for some searches, and the MAX scorer for other searches.
 
 Thanks for you help.
 
 -Ken
 
 PS.  The code I'm talking about falls in the follwoing area, for my
 example search car OR automobile.  If I walk the code during search,
 I see that the BooleanScorer$Collector is created by the Weight that
 was just created, in BooleanQuery$BooleanWeight.scorer(...), as it
 adds the subscorers for each of the terms in the BooleanScorer.  When
 that collector is asked to collect(...), its bucketTable is filled in.
  Since the collectors for each of the terms use the same bucketTable,
 if the document already appears in the bucketTable, then it's score is
 added to implement a SUM operator.

SInce you are that far already, you can (in reverse order):
- replace the BooleanScorer by another one that takes the max
 instead of summing.
- replace the weight to return that scorer.
- replace the BooleanQuery to return that weight.
- override QueryParser.getBooleanQuery() to return that query
 in the cases you want, that is when all clauses are optional.

replace usually means inherit from in new code.
When you need more info on this, try lucene-dev.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query#rewrite Question

2004-11-11 Thread Paul Elschot

On Thursday 11 November 2004 03:51, Satoshi Hasegawa wrote:
 Hello,
 
 Our program accepts input in the form of Lucene query syntax from the user, 
 but we wish to perform additional tasks such as thesaurus expansion. So I 
 want to manipulate the Query object that results from parsing.
 
 My question is, is the result of the Query#rewrite method guaranteed to be 
 either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a 
 BooleanQuery, do all the constituent clauses also reduce to one of the above 
 three classes? If not, what if the original Query object was the one that 
 was obtained from QueryParser#parse method? Can I assume the above in this 
 restricted case?

 I experimented with the current version, and the above seems to be positive 
 in this version; I'm asking if this could change in the future. Thank you. 
 
In general, a Query should either rewrite to another query, or provide a
Weight. During search, the Weight then provides a Scorer to score the docs.

The only other type of query currently available is SpanQuery, which is
a generalization of PhraseQuery. It does not rewrite and provides a Weight.

However, the current QueryParser does not have support for SpanQuery.
So, as long as the QueryParser does not support more than the current types
of queries, and you only use the QueryParser to obtain queries, all the
constituent clauses will reduce as you indicate above.

SpanQuery could be useful for thesaurus expansion. The generalization
it provides is that it allows nested distance queries. For example, in:
word1 word2~2
word2 can expanded to:
word2 or word3 word4~4
leading to a query that is not supported by the current QueryParser:
word1 (word 2 or word3 word4~4)~2

SpanQueries can also enforce an order on the matching subqueries,
but that is difficult to express in the current query syntax.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Paul Elschot
Luke,

On Tuesday 09 November 2004 20:58, you wrote:
 Hi,
 
 I've implemented a converter to translate our system's internal Query
 objects to Lucene's Query model.
 
 I recently realized that my implementation of OR NOT was not working
 as I would expect and I was wondering if anyone on this list could give
 me some advice.

Could you explain OR NOT ? 

Lucene has no provision for matching by being prohibited only. This can
be achieved by indexing something for each document that can be
used in queries to match always, combined with something prohibited
in a query.
But doing this is bad for performance for querying larger nrs of docs.

Lucene's - prefix in queries means AND NOT, ie. the term with the - prefix
prohibits the matching of a document.
 
 I am converting a query that means foo or not bar into the following:
 
 +item_type:xyz +(field_name:foo -field_name:bar)
 
 This returns only Documents where field_name contains foo. I would
 expect it to return all the Documents where field_name contains foo or
 field_name doesn't contain bar.
 
 Fiddling around with the Lucene Index Toolbox, I think that this query
 does what I want:
 
 +item_type:xyz field_name:foo -field_name:bar
 
 Can someone explain to me why these queries return different results?

A bit dense, but anyway:

Anything prefixed with + is required.
Anything not having + or - prefix is optional and only influences the score.
In case there is nothing required by a + prefix, at least one of the things
without prefix is required.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: can lucene be backed to have an update field

2004-11-09 Thread Paul Elschot
Chris,

On Tuesday 09 November 2004 22:54, Chris Fraschetti wrote:
 Is it possible to modify the lucene source to create an
 updateDocument(doc#, FIELD, value)  function ? 

It's possible, but an implementation would not be efficient
when the field is indexed. The current index structure
has no room to spare for insertions, and no provision for
deleted terms.

Some time ago an extra level was added in the index
for skipping ahead more efficiently. Perhaps that could
be combined with a gap for insertions. But when such a gap
would fill up there would again be no choice but to delete and add 
the changed document.
Also adding a document without optimizing is quite efficient
already, so there is probably not much interest in adding
such gaps.

In case the field is stored only and the value would have the
same length as the currently stored value it would be possible
to replace the value efficiently.

The only updates available are on the field norms.
 
Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the difference between these searches?

2004-11-09 Thread Paul Elschot
On Tuesday 09 November 2004 23:14, Luke Francl wrote:
 On Tue, 2004-11-09 at 16:00, Paul Elschot wrote:
 
  Lucene has no provision for matching by being prohibited only. This can
  be achieved by indexing something for each document that can be
  used in queries to match always, combined with something prohibited
  in a query.
  But doing this is bad for performance for querying larger nrs of docs.
 
 I'm familiar with Lucene's restrictions on prohibited queries, and I
 have a required clause for a field that will always be part of the query
 (it's not a nonsense value, it's the item type of the object in a CMS). 

That might also be mapped  to a filter.
 
 My problem is that I have been considering the whole query object that
 I've generated. Every BooleanQuery that's a part of my finished query
 must also have a required clause if it has a prohibited clause.
 
 I'm thinking of refactoring my code so that instead of joining together
 Query objects into a large BooleanQuery, it passes around BooleanClauses
 and assembles them into a single BooleanQuery.

It may not be possible to flatten a boolean query to a single level, eg:
(+aa +bb) (+cc +dd)
+(a1 a2) +(b1 b2)

These will generate nested BooleanQuery's iirc.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search speed

2004-11-02 Thread Paul Elschot
On Monday 01 November 2004 21:02, Jeff Munson wrote:
 I'm looking for tips on speeding up searches since I am a relatively new
 user of Lucene.  
 
 I've created a single index with 4.5 million documents.  The index has
 about 22 fields and one of those fields is the contents of the body tag
 which can range from 5K to 35K.  When I create the field (named
 contents) that houses the contents of the body tag, the field is
 stored, indexed, and tokenized.  The term position vectors are not
 stored.  
 
 Single word searches return pretty fast, but when I try phrases,
 searching seems to slow considerably.  When constructing the query I am
 using the standard query object where analyzer is the StandardAnalyzer:
 
 Code Example:
 Query objQuery = QueryParser.parse(sSearchString, contents, analyzer);
 
 For example, the following query,  contents:Zanesville, it returns over
 163,000 hits in 78 milliseconds.  
 
 However, if I use this query, contents:all parts including picture tube
 guaranteed, it returns hits in 2890 millseconds.  Other phrases take
 longer as well.  
 
 My question is, are there any indexing tips (storing term vectors?) or
 query tips that I can use to speed up the searching of phrases?

Term vectors should not influence search times for phrases.

What you're seeing is this: for each term in your query Lucene
has to walk all the documents containing the term. For a single
term there is no speed problem because the document set for the term
is stored in a compact way on disk.
For multiple terms with large document sets the disk head needs to
move between the document sets of the terms because all sets
need to be walked synchronously over the documents to compute
the document scores.
For phrases even more disk accesses are needed to access the
term positions within the documents.
Normally the disk head seeks are degrading the performance.

One way to avoid the disk head seeks is to use fewer terms in the phrases.
Another way is to avoid using the term positions by querying for words
instead of phrases.

In case you have hardware/resources there are more options
like using faster disks and/or using RAM for critical parts of the index.
Lucene can use extra RAM in various ways. To configure that one may have
to do some java coding. Profiling can guide you there.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search speed

2004-11-02 Thread Paul Elschot
On Tuesday 02 November 2004 17:50, Jeff Munson wrote:
 Thanks for the info Paul.  The requirements of my search engine are that
 I need to search for phrases like death notice or world war ii.  You
 suggested that I break the phrases into words.  Is there a way to break
 the phrases into words, do the search, and just return the documents
 with the phrase?  I'm just looking for a way to speed up the phrase
 searches.

If you know the phrases in advance, ie. before indexing, you can index
and search them as terms with a special purpose analyzer.
It's an unusual solution, though.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: When do document ids change

2004-10-29 Thread Paul Elschot
Justin,

On Friday 29 October 2004 20:48,  you wrote:
 Given an FSDirectory based index A.
 Documents are added to A with an IndexWriter
   minMergeDocs = 2
   mergeFactor = 3
 
 Documents are never deleted.
 
 Once the RAMDirectory merges documents to the index:
 
 a) will the documentID values for index A ever change?

A document id may change after deleting a document that was
added earlier than the document.
Adding more docs may then change the id.
Optimizing the index will then change the id.

 b) can a mapping between a term in the document and newly created
 documentID be made?

Yes. See below on how.
 
 Why I am asking this question:
 I have a database with about 10M rows in it.  My search engine needs
 to be able to quickly
 get all the rows back from the database that match a query.  All the
 rows need to be
 returned at once, because the entire result set is sorted based on user input.  

Did you try IndexSearcher.search() or Search.search() with a Sort argument?

 What I want to do:
 When a documentID gets assigned to a document, I want to update the
 database row with
 that matches the document field id with the lucene documentID.  That
 way, I can use a
 hitcollector to gather just the documentID values from the search and
 insert them into a
 temporary cache table, then grab the matching rows from the database. 
 This will work assuming the documentID values for the given document
 never change.

It will work on the condition that documents are never (in the absolute sense)
deleted from the lucene index, and that one never merges indexes.
 
 Currently, running an IndexSearcher.search() and getting all the rows
 back takes between
 5 and 30 seconds for most queries, which is certainly not fast enough.
  The time it takes to collect the documentIDs however is less than 1
 second.  All the time is taken by calling
 hits.doc() for each document to get the id field to insert into the database. 

One can speed up retrieving data from Lucene indexes by retrieving
in the order of docId, via indexReader.document(docId). Make sure
no other threads are using the index at the same time.
One can also store the Lucene files with the stored fields on another disk,
but for that some coding is needed.

You may have to implement your own HitCollector. 
Lucene does not guarantee that the hits are collected in 
order of docId, but the collecting order is normally not far off.
 
 So finally,  will what I want to do work, and if so, how can I go

It will work, but I would not recommend it. Just retrieve what you
need from the Lucene index in the order of the docId's.
Try and store as little data per document as possible.

 about updating the database when the documentID is created?

To know the docId use an indexed primary key in lucene and search
for it using IndexReader.termDocs(new Term(keyField, keyValue)).

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread Genty Jean-Paul
At 17:05 25/10/2004, you wrote:
of course POI, for open source.
There are some commercial products based on POI also.
for WORD consider textmining.org
for XLS, POI does anything you need
for powerpoint  there is one commercial (it's about 1000$), but you can 
also find some source code in archives.
 And what do you think about using Open Office's UNO APIs  ?
 If someone did, does it scale well ? (I just did some unit testing )
Jean-Paul  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread Genty Jean-Paul
At 19:42 25/10/2004, you wrote:
At 17:05 25/10/2004, you wrote:
of course POI, for open source.
There are some commercial products based on POI also.
for WORD consider textmining.org
for XLS, POI does anything you need
for powerpoint  there is one commercial (it's about 1000$), but you can 
also find some source code in archives.

 And what do you think about using Open Office's UNO APIs  ?
I didn't knew about them. Are they implemented in Java?
Yes
 Check out  http://api.openoffice.org/ , They have good examples, I can 
also provide you my small test.
 You can do some amazing things with their API.

Do they support all MSOffice formats (97/2000/XP)?
Check http://www.openoffice.org/product/docs/OOoFlyer11s.pdf
Jean-Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


problems deleting documents / design question

2004-10-22 Thread Paul
Hi,
I'm creating an index from several database tables. Every item within
every table has a unique id which is saved in some kind of id-field
and the table name in an other one. So together they form a unique
identifier within the index. When deleting / updating an item I need
to retrieve it. My first idea was
indexreader.delete(new Term(id, id-value));
but this could delete several entries as id-value may appear in
several databases.
My second idea was to combine database name and id to form a kind of
unique identifier but this seems to be not the right way as the
problem may occur again with some sub-ids within a certain table.
So my question is: is it possible to determine the item to be deleted
by more than one term?

thx,
Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: threading and indexing......

2004-10-16 Thread Paul Elschot
On Saturday 16 October 2004 02:14, Otis Gospodnetic wrote:
 If all 4 threads use the same instance of IndexWriter everything should
 be okay, as Lucene synchronizes vital blocks.

And on a single CPU with a single disk using up to three threads even
gives a bit of a speed up over one thread, 10-15% iirc. More threads
were of no use for me in that case.

Regards,
Paul Elschot

 Otis

 --- Chris Fraschetti [EMAIL PROTECTED] wrote:
  if i have four threads all trying to call my index function, will
  lucene do what is necessary for each thread to wait until the writer
  is available.. or will the threads get an exception?
 
  --
  ___
  Chris Fraschetti, Student CompSci System Admin
  University of San Francisco
  e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorting and score ordering

2004-10-13 Thread Paul Elschot
On Wednesday 13 October 2004 19:53, Chris Fraschetti wrote:
 Is there a way I can (without recompiling) ... make the score have
 priority and then my sort take affect when two results have the same
 rank?

 Along with that, is there a simple way to assign a new scorer to the
 searcher? So I can use the same lucene algorithm for my hits, but
 tweak it a little to fit my needs?

There is no one to one relationship between a seacher and a scorer.

When a query consists eg. of two terms, there will be three scorers
executing the search for that query: one TermScorer for each term,
and one scorer to combine the other two to provide the search results,
usually a BooleanScorer or a ConjunctionScorer.
For proximity queries, other scorers are used.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Special field values

2004-10-12 Thread Paul Elschot
On Tuesday 12 October 2004 15:02, Otis Gospodnetic wrote:
 Hello Michael,

 This is something you'd have to code on your own.

 Otis

 --- Michael Hartmann [EMAIL PROTECTED] wrote:
  Hi everybody,
 
  I am thinking about extending the Lucene search with metadata in the
  following way
 
  Field   Value

 ---

  Title   (n1, n2, n3, ..., nm) | ni element of {0,1} and m amount of
  distinct
  metadata values for title
 
  Expressed in an informal way, I want to store a tuple of values in a
  field.
  The values in the tuple show whether a value is used in the title or
  not.

A Lucene index can easily be used to determine whether or not a term is
in a field of a document:

IndexReader.open(indexName).termDocs(new Term(term, field)).skipTo(documentNr)

returns the boolean indicating that.
What do you need the {0,1} values for?

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Special field values

2004-10-12 Thread Paul Elschot
On Tuesday 12 October 2004 19:27, Paul Elschot wrote:


 IndexReader.open(indexName).termDocs(new Term(term,
 field)).skipTo(documentNr)

 returns the boolean indicating that.

Well, almost. When it returns true one still needs to check the TermDocs
for being at the documentNr.

Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to pull document scoring values

2004-09-29 Thread Paul Elschot
Zia,

On Tuesday 28 September 2004 21:22, you wrote:
 Hi,

 I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
 each parameter value individually as they are collectively dumped out by
 Explanation. I've managed to pull out TF and IDF values using
 DefaultSimilarity and FilterIndexReader, but not sure from where to get
 the fieldNorm and queryNorm from.

The norms are here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String)
The resulting array is indexed by the document number for the IndexReader.
With the default similarity, each norm is the inverse square root of the number of 
indexed terms in the 
document field. However, there are only 8 bits available to encode this value, so it's 
quite rough.

The default queryNorm is here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
There is an explanation of the scoring in the javadocs of Similarity.
There has been some discussion on an idf factor that was missing from this 
documentation, 
I don't know whether the docs have been adapted for this.

 Also is there any reference about how normalisation has been
 implemented?

See above, DefaultSimilarity is the default implementation of the Similarity interface.
queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights
from the query. It returns the square root.

It may be that the sum of squared weights should be a sum of square rooted weights
and that queryNorm should return a square then.
I posted this on lucene-user on 20 September:
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=10023

It's only a normalisation, so it doesn't affect the order of the search results much.
Taking the square roots of the  query term weights would have
the query weights directly apllied to the the query term density in the document field,
whereas now the weights seem to be applied to the square root of the density.
The density value is an approximation, see above for the rough field norms.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to pull document scoring values

2004-09-29 Thread Paul Elschot
On Wednesday 29 September 2004 15:41, Zia Syed wrote:
 Hi Paul,
 Thanks for your detailed reply! It really helped alot.
 However, I am experiancing some conflicts.

 For one of the documents in result set, when i use

 IndexReader fir=FilterIndexReader.open(index);
 byte[] fNorm=fir.norm(Body);
 System.out.println(FNorm: + fNorm[306]);
 Document d=fir.document(306);
 Field f=d.getField(Body);

 System.out.println(Body: + f.stringValue());

 This gives me out fNorm 113, whereas total number of term (including
 stop-words) are 42 in this particular field of selected document. In the
 explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx
 41 term words for that field in that documents. So explanation values
 makes sense with real data, while including all stop words like to,it,
 the  etc.

 So, my question is,

  Am i getting the norm values from right place?

Yes, but the stored norms are encoded/decoded:
byte Similarity.encodeNorm(float)
float Similarity.decodeNorm(byte)

  Is there any way to find out number of indexed terms for each

 document?

By default, the stored norm is the inverse square root of 
the number of indexed terms of an indexed document field.
The encoding/decoding is somewhat rough, though.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP and Lucene

2004-09-22 Thread Paul Waite
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Erik Hatcher wrote:

 On Sep 15, 2004, at 1:45 PM, Karthik N S wrote:
  1) Is a there a PHP version of Lucene Implemantation avaliable , If so
  Where?

 Using the Java version of Lucene from PHP is my recommendation.  There
 is not a PHP version.  I'm not familiar with PHP details, but I suspect
 you can very easily use the Java version somehow.


A bit tardy, but I was in-between versions, hence wanted to wait until
I had posted the new ones up.

We have developed a java-based daemon we call Luceneserver, and
which listens on a port and understands either of two text protocols, one
line-based, and one XML. This allows people to set up a server box
centrally, and then use Php, Perl, Java, or whatever to index/search a
central Lucene repository pretty easily. It has been designed such that
you can partition off separate domains (eg. websites) within the
same index, if you wish.

In particular we've also developed a family of Php classes to talk to the
above via the XML protocol, included in an opensource web development
platform we call Axyl.

Taken together, all of this might (or might not) be of some use to the
original poster, as a starting point, or just for ideas.

Version 2.1.1-1 of Axyl and Axyl-Lucene is available at:
 http://sourceforge.net/projects/axyl

Cheers,
Paul.
- -- 
LIBRA (Sept. 23 - Oct. 22)
 Major achievements, new friends, and a previously unexplored way
 to make a lot of money will come to a lot of people today, but
 unfortunately you won't be one of them.  Consider not getting out
 of bed today.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBUeVetfkpAgkMOyMRAm1pAJ9AAOh54bivGeyDLc9sdUMC8kmKmwCgvX9i
+0JtZzP30AFVThe9z4BH0Fw=
=9faA
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildCardQuery

2004-09-21 Thread Paul Elschot
On Tuesday 21 September 2004 06:50, Raju, Robinson (Cognizant) wrote:
 Is there a limitation in Lucene when it comes to wildcard search ?
 Is it a problem if we use less than 3 characters along with a
 wildcard(*).
 Gives me error if I try using 45* , *34 , *3 ..etc .
 Too Many Clauses Error
 Doesn't happen if '?' is used instead of '*'.
 The intriguing thing is , that it is not consistent . 00* doesn't fail.
 Am I missing something ?

The number of clauses added to the query equals the number of
indexed terms that match the wildcard. As each clause ends up using
some buffer memory internally, a maximum was introduced to
avoid running out of memory.
You can change the maximum nr of added clauses using
BooleanQuery.setMaxClauseCount() but then it is advisable
to monitor memory usage, and evt. increase heap space for the JVM.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: displaying 'pages' of search results...

2004-09-21 Thread Paul Elschot
On Tuesday 21 September 2004 21:33, Chris Fraschetti wrote:
 I was wondering was the best way was to go about returning say
 1,000,000 results, divided up into say 50 element sections and then
 accessing them via the first 50, second 50, etc etc.

 Is there a way to keep the query around so that lucene doesn't need to
 search again, or would the search be cached and no delay arise?

 Just looking for some ideas and possibly some implementational issues...

Lucene's Hits class is designed for paging through search results.

In which order would you need the 1.000.000 results?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many boolean clauses

2004-09-20 Thread Paul Elschot
On Monday 20 September 2004 18:27, Shawn Konopinsky wrote:
 Hello There,

 Due to the fact that the [# TO #] range search works lexographically, I am
 forced to build a rather large boolean query to get range data from my
 index.

 I have an ID field that contains about 500,000 unique ids. If I want to
 query all records with ids [1-2000],  I build a boolean query containing
 all the numbers in the range. eg. id:(1 2 3 ... 1999 2000)

 The problem with this is that I get the following error :
 org.apache.lucene.queryParser.ParseException: Too many boolean clauses

 Any ideas on how I might circumvent this issue by either finding a way to
 rewrite the query, or avoid the error?

You can use this as an example:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/DateFilter.java

(Just click view on the latest version to see the code).

and iteratate over you doc ids instead of over dates.
This will give you a filter for the doc ids you want to query.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Similarity scores: tf(), lengthNorm(), sumOfSquaredWeights().

2004-09-20 Thread Paul Elschot

After last week's discussion on idf() of the similarity score computation
I looked into the score computation a bit deeper.

In the DefaultSimilarity tf() is the sqrt() and lengthNorm() is the inverse
of sqrt(). That means that the factor (docTf * docNorm) actually
implements the square root of the density of the query term in the
document field (ignoring the encoding and decoding of the norm).

Summing these weighted square roots resembles a 
Salton OR p-Norm for p = 1/2, except that Salton defined
the p-Norm's for p = 1, and the result is more like an AND
p-Norm because it depends mostly on the minimum argument.

The pnorm also requires that the sum is taken to the power 1/p,
but this is not necessary as it would not change the ranking.

I looked around for p-Norm's with 0p1, but I didn't find
anything. Is there really nothing about this? A good discussion is here:
http://elvis.slis.indiana.edu/irpub/SIGIR/1994/cite19.htm

I would guess that since the sqrt() has an infinite derivative at zero, it
might well be that this OR p-Norm for p = 1/2 behaves much like a
rather high power AND p-Norm.

The basic summing form of the OR p-Norm also allows a very easy
implementation by just summing the weighted square roots; an AND
p-Norm for p = 1 would have needed some more calculations.
Is this perhaps one of the reasons for using a power p   1 ?

Taking this a bit further, I also wonder about the name of
sumOfSquaredWeights() in the Weight interface.
Shouldn't that rather be  sumOfPowerWeights() and 
by default implement a sum of square roots?
This would allow a more straightforward comprehension
of the of the term weights as directly weighing the term densities.

Section 5 of the reference above has the full weighted
p-Norm formula's. The OR p-Norm there is very close
to the Lucene formula without coord().

Regards,
Paul Elschot

On Tuesday 14 September 2004 23:49, Doug Cutting wrote:
 Your analysis sounds correct.

 At base, a weight is a normalized tf*idf.  So a document weight is:

docTf * idf * docNorm

 and a query weight is:

queryTf * idf * queryNorm

 where queryTf is always one.

 So the product of these is (docTf * idf * docNorm) * (idf * queryNorm),
 which indeed contains idf twice.  I think the best documentation fix
 would be to add another idf(t) clause at the end of the formula, next to
 queryNorm(q), so this is clear.  Does that sound right to you?

 Doug

 Ken McCracken wrote:
  Hi,
 
  I was looking through the score computation when running search, and
  think there may be a discrepancy between what is _documented_ in the
  org.apache.lucene.search.Similarity class overview Javadocs, and what
  actually occurs in the code.
 
  I believe the problem is only with the documentation.
 
  I'm pretty sure that there should be an idf^2 in the sum.  Look at
  org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
  can see that first sumOfSquaredWeights() is called, followed by
  normalize(), during search.  Further, the resulting value stored in
  the field value is set as the weightValue on the TermScorer.
 
  If we look at what happens to TermWeight, sumOfSquaredWeights() sets
  queryWeight to idf * boost.  During normalize(), queryWeight is
  multiplied by the query norm, and value is set to queryWeight * idf
  == idf * boost * query norm * idf == idf^2 * boost * query norm.  This
  becomes the weightValue in the TermScorer that is then used to
  multiply with the appropriate tf, etc., values.
 
  The remaining terms in the Similarity description are properly
  appended.  I also see that the queryNorm effectively cancels out
  (dimensionally, since it is a 1/ square root of a sum of squares of
  idfs) one of the idfs, so the formula still ends up being roughly a
  TF-IDF formula.  But the idf^2 should still be there, along with the
  expansion of queryNorm.
 
  Am I mistaken, or is the documentation off?
 
  Thanks for your help,
  -Ken
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many boolean clauses

2004-09-20 Thread Paul Elschot
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote:
 Hey Paul,

 Thanks for the quick reply. Excuse my ignorance, but what do I do with the
 generated BitSet?

You can return it in in the bits() method of the object implementing your
org.apache.lucene.search.Filter (http://jakarta.apache.org/lucene/docs/api/index.html)
Then pass the Filter to IndexSearcher.search() with the query.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many boolean clauses

2004-09-20 Thread Paul Elschot
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote:
 Hey Paul,

...

 Also - we are using a pooling feature which contains a pool of
 IndexSearchers that are used and tossed back each time we need to search.
 I'd hate to have to work around this and open up an IndexReader for this
 particular search, where all other searches use the pool. Suggestions?

You could use a map from the IndexSearcher back to the IndexReader that was
used to create it. (It's a bit of a waste because the IndexSearcher has a reader
attribute internally.)

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



problem with locks when updating the data of a previous stored do cument

2004-09-16 Thread Paul Williams
Hi,

Using lucene-1.4.1.jar on WinXP  

I am having trouble with locking and updating an existing Lucene document. I
delete the old document from the index and then add the new document to the
index writer. I am using the minMerge docs set to 100 (much quicker!!) and
close the writer once the batch is done, so the documents are flushed to the
filesystem

The problem i am having is I can't delete the old version of the document
(after the first document has been added) using reader.delete because there
is a lock on the index due to the IndexWriter being open.

Am I doing this wrong or is there a simple way round this?

Regards,

Paul


Code snippets of the update code (I have just cut and pasted the relevant
line from my app to get an idea)


reader = IndexReader.open(location);
// Delete old doc/term if present
if (reader.docFreq(docNumberTerm)  0) {
reader.delete(docNumberTerm);
.
.
.

IndexWriter writer = null;

// get the writer from the hash table so last few are cached and don't
have to be restarted
synchronized(IndexWriterCache) {

   String dbstring =  + ldb;
   writer = (IndexWriter)IndexWriterCache.get(dbstring);

   if (writer == null) {
   //Not in cache so create one and add to cache for next time

   writer = new IndexWriter(location, new StandardAnalyzer(),
new_index);

   writer.setUseCompoundFile(true);

   // Set the maximum number of entries per field. Default is
10,000
   writer.maxFieldLength = MaxFieldCount;

   // Set how many docs will be stored in memory before being
saved to disk
   writer.minMergeDocs = (int) DocsInMemory;

   IndexWriterCache.remove(dbstring);
   IndexWriterCache.put(dbstring, writer);
}
.
.
.
  
// Add the docuents to the Lucene index
writer.addDocument(doc);




.
. Some time later after a batch of docs been added
 
   writer.close();







DISCLAIMER:

The information in this message is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this message
by anyone else is unauthorised. If you are not the intended recipient, any
disclosure, copying, or distribution of the message, or any action or
omission taken by you in reliance on it, is prohibited and may be unlawful.
Please immediately contact the sender if you have received this message in
error.

Thank you.

Valid Information Systems Limited. Address: Morline House, 160 London Road,
Barking, Essex, IG11 8BB. 

http://www.valinf.com Tel: +44 (0) 20 8215 1414 Fax: +44 (0) 20 8215 2040

Please note that as part of our drive to continually improve the service to
our clients, we have established a dedicated support line for customers to 
call if they are in need of help with their installation of R/KYV or have a
query regarding the operation of the software. The number is - 0870 0161414
This will ensure any call is carefully noted, any action required is 
scheduled for completion and any problem experienced handled by a carefully
chosen team of developers. Please make a note of this number and pass it 
on to any other relevant person within your organisation.
 
*

--
Visit Valid who will sharing a stand with partners, Goss Interactive at the

SOCITM Event, 10- 12 October 2004, Edinburgh International Conference Centre (EICC), 
Stand 26  26P-.  
Booking available online: www.socitm.gov.uk
--

#
This e-mail message has been scanned for Viruses and Content and cleared 
by NetIQ MailMarshal
#

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Build problems

2004-09-03 Thread Paul Elschot
Danny,

On Friday 03 September 2004 20:53, [EMAIL PROTECTED] wrote:
 I'm trying to build Lucene with ant (in XP) from the prompt
 I got the ant-optional.jar from
 http://archive.apache.org/dist/ant/binaries/ because I
 couldn't find it anywhere else.  I'm running the newest
 version of ant and when I go into the lucene base directory
 and type 'ant' it finds the build.xml file but then gives the
 following error:

 BUILD FAILED
 C:\lucene\build.xml:140: srcdir C:\lucene\src\java does not
 exist!


The src/java directory normally contains the java source files.

Since that directory doesn't exist you may want to create it
by installing the sources, eg. by checking out from cvs, or
from a jar or that contains the java sources here:
http://dist.apache.easynet.nl/jakarta/lucene/source/

Lucene 1.4.1 is out, but it's not available there yet.
In case you want that version please ask on lucene-dev.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using 2nd Index to constraing Search

2004-08-27 Thread Paul Elschot
On Friday 27 August 2004 20:10, Mike Upshon wrote:
 Hi

 Just starting to evaluate Lucene and hope somone can answer this
 question.

 I am looking at using Lucene to index a very large databse. There is a
 documents table and a few other tables that define what users can view
 what documents. My question is, is it posible to have an index of the

The normal way of doing that is to:
- make a list of all doc id's for the user.
- from this list construct a Filter for use in the full text index.
Sort the doc id's, use an IndexReader on the full text index, construct
a Term for each doc id, walk the termDocs() for the Term, and set
a bit in the filter to allow the document number for the doc id.
- keep this filter to restrict the searches for the user by
IndexSearcher.search(Query,Filter)
- rebuild the filter when the doc id's for the user change, or when
the full text index changes (a document deletion followed
by an optimize or an add can change any other document's number).

Hmm, this is getting to be a FAQ.

 full text contents of the documents and another index that contains the
 document id's and the user id's and then use the 2nd index to qualify
 the full text search over the document table. The reason I want to do
 this is to reduce the numbers of documents that the full text query will
 run.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question concerning speed of Lucene.....

2004-08-27 Thread Paul Elschot
Oliver,

On Friday 27 August 2004 22:20, you wrote:
 Hi,

 I guess this one of the most often asked question on this mailing list, but
 hopefully my question is more specific, so that I can get some input from
 you.

 My project is to implement an agency system for newspapers. So I have to
 handle about 30 days of text and IPTC data. The later is taken from images
 provided by the agencies. I basically get a constant stream of text
 messages from the agencies (roughly 2000 per day per agency) and images
 (roughly 1000 per day per agency). I have to deal with 4 text and 6 image
 agencies. So my daily input is 8000 text messages and 6000 images. The
 extracted documents from these text messages and images have a size of
 about 1kb.

 The extraction of the data and converting them to Document objects is
 already finished and the search using lucence works like a charm. Brilliant
 software!

 But now to my questions. In order to understand what I am doing, like to
 talk a little about the kind of queries and data I have to deal with.

 * Every message has a priority. An integer value ranging from 1 to 6.
 * Every message has a receive date.
 * Every message has an agency assigned, basically a unique string
 identifier for it.
 * Every message has some header data, that is also indexed for refined
 searches.
 * And of course the actual text included in the text message itself or
 the IPTC header of an image.

 Typically I have to kinds of queries.

 * Typical relational queries

 * Show every text messages from a certain agency in the last X days.

Probably good for a date filter, see the wiki on RangeQuery, and evt. my
previous message on filters (using 2nd index on constraining). Lucene has
no facilities for primary keys, so that is up to you.

 * Show every image or text message with a higher priority then Y and
 from a certain period of time.

RangeQuery again for the priority.
One can store images in Lucene, but currently only in String format, ie.
they'll need some conversion. There was some talk on binary
objects (not too) recently, but that is still in development. I'd probably store the
images in a file system or in another db for now. OTOH, if you're willing
to help storing binary images lucene-dev is nearby.

 * Fulltext search

Yes :)

 * A real fulltext search over all elements using the full power of
 lucences query language.

Span queries are currently not supported by the query language,
you might have a look at the org.apache.lucene.search.spans package.

 It is absolutely no question anymore, that the later queries will be done
 using Lucene. But can the first type of query is the thing I am thinking
 about. Can this be done effeciently with Lucene? So far we use a system

Lucene can be as fast as relational databases, provided your lower level
java code on IndexReader plays nice with system resources like disk heads
and RAM.
That means using filters, sorting on index order before using an index
and evt. sorting on document number before retrieving stored fields.
Lucene's IndexSearcher for searching text queries is quite well behaved
in that respect. 

 that uses a SQL database engine for storing the relevant data and is used
 in these queries. But if Lucene is fast enough with these queries too, I am
 willing to skip the SQL database at all. But I have to remind, that I will
 be indexing about 400.000 messages per month.

To easily keep the primary keys in sync between the SQL db and Lucene,
I'd start by keeping the images and the full text only in the SQL db.
Lucene optimisations (needed after adding/deleting docs) copy all data
so it pays to keep the Lucene indexes small.

Later you might need multiple indexes, MultiSearcher, and occasionally
a merge of the indexes.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How not to show results with the same score?

2004-08-25 Thread Paul Elschot
On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote:
 hi there,

 i browsed through the list and had some different searches but i do not
 find, what i'm looking for.

 i got an index which is generated by a bot, collecting websites. there
 are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1
 these different urls have the same content and when u search for a word,
 matching, both are returned, which is correct.

 they have excatly the same score because of there content an so one, so
 i would like to know if its possible to group by (mysql, of course)
 the returned score, so that only the first match is collected into
 Hits and all following matches with the same score are ignored.

 it would be great if anyone has an idea how to do that.

You can implement your own HitCollector and pass it to IndexSearcher.search()
Have a look at the javadocs of the org.apache.lucene.search package,
it's quite straightforward. The PriorityQueue from the
util package is useful to collect results. For every distinct score you could
store an int[] of document nrs in there while collecting the hits.
Basically you'll end up implementing your own Hits class.

For URL's that have the same content, it's better
to store multiple URL's for the same document. However, this
merging is normally done by a crawler because the same contents
means the same outgoing URL's. Crawlers also keep track
of multiple host names resolving to the same IP address.

In case you need to crawl and index an intranet or more, have a look
at Nutch.

Regards,
Paul Elschot




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Paul Elschot
On Wednesday 18 August 2004 22:44, Rob Jose wrote:
 Hello
 I have indexed several thousand (52 to be exact) text files and I keep
 running out of disk space to store the indexes.  The size of the documents
 I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
 287 GB.  Does this seem correct?  I am not storing the contents of the

As noted, one would expect the index size to be about 35%
of the original text, ie. about 2.5GB * 35% = 800MB.
That is two orders of magnitude off from what you have.

Could you provide some more information about the field structure,
ie. how many fields, which fields are stored, which fields are indexed,
evt. use of non standard analyzers, and evt. non standard
Lucene settings?

You might also try to change to non compound format to have a look
at the sizes of the individual index files, see file formats on the lucene
web site.
You can then see the total disk size of for example the stored fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: PDFBox Issue

2004-08-17 Thread Paul Smith
What version of the log4j jar are you using? 

 -Original Message-
 From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, June 29, 2004 8:06 AM
 To: Lucene Users List
 Subject: PDFBox Issue
 
 Hi all,
 
 I know that this is a Lucene list but wanted to know if any of you have
 gotten this error before using PDFBox?
 
 I've gotten the latest version of PDFBox and it is giving me the following
 error:
 
 java.lang.VerifyError: (class: org/apache/log4j/LogManager, method:
 clinit signature: ()V) Incompatible argument to function
 at org.apache.log4j.Logger.getLogger(Logger.java:94)
 at org.pdfbox.pdfparser.PDFParser.clinit(PDFParser.java:57)
 at
 org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocum
 ent.java:197)
 at
 org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocu
 ment.java:118)
 at Index.indexFile(Index.java:287)
 at Index.indexDirectory(Index.java:265)
 at Index.update(Index.java:63)
 at Lucene.main(Lucene.java:26)
 Exception in thread main
 
 I am using all the jar files that came with PDFBox.
 
 Anyone run into this problem.  I am using the following line of code:
 
 Document doc = LucenePDFDocument.getDocument(f);
 
 Thanks
 
 
 Don Vaillancourt
 Director of Software Development
 
 WEB IMPACT INC.
 416-815-2000 ext. 245
 email: [EMAIL PROTECTED]
 web: http://www.web-impact.com
 
 
 
 
 This email message is intended only for the addressee(s)
 and contains information that may be confidential and/or
 copyright.  If you are not the intended recipient please
 notify the sender by reply email and immediately delete
 this email. Use, disclosure or reproduction of this email
 by anyone other than the intended recipient(s) is strictly
 prohibited. No representation is made that this email or
 any attachments are free of viruses. Virus scanning is
 recommended and is the responsibility of the recipient.
 
 
 
 
 
 
 
 
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: PDFBox Issue

2004-08-17 Thread Paul Smith
I actually thought it might have been trying to use the log4j 1.3 'alpha'
build (there is no 'alpha' build yet, but notionally the latest HEAD isn't
too far from it).  There has been a subtle change to log4j in recent months
that could have a similar impact.
Cheers,

Paul Smith
 -Original Message-
 From: Ben Litchfield [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, August 17, 2004 10:48 PM
 To: Lucene Users List
 Subject: Re: PDFBox Issue
 
 
 PDFBox comes with log4j version 1.2.5(according to MANIFEST.MF in jar
 file), I believe that 1.2.8 is the latest.  I will make sure that the next
 version of PDFBox includes the latest log4j version, which I assume is
 what everybody would like to use.
 
 But, by looking at the below error message it appears that you might have
 an older log4j in your classpath
 
 Logger.getLogger( Class ) is available in 1.2.5 and 1.2.8
 
 
 Ben
 
 
 On Tue, 17 Aug 2004, Don Vaillancourt wrote:
 
  Wow, this is an old message.
 
  I managed to get my code to work by using the previous version of
  PDFBox.  I had used the version of log4j that had come with PDFBox.
 
  Someone had mentioned recompiling log4j, but I couldn't get the project
  to import the source into Eclipse, so I gave up.  But things work great
  with the version of PDFBox that I compiled with so I am fine with that.
 
  As for the version of log4j, I could not tell you, as I said above it
  came with PDFBox, so I'm guessing that it had probably not been tested
  with the version of log4j it was being distributed with.
 
  Paul Smith wrote:
 
  What version of the log4j jar are you using?
  
  
  
  -Original Message-
  From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, June 29, 2004 8:06 AM
  To: Lucene Users List
  Subject: PDFBox Issue
  
  Hi all,
  
  I know that this is a Lucene list but wanted to know if any of you
 have
  gotten this error before using PDFBox?
  
  I've gotten the latest version of PDFBox and it is giving me the
 following
  error:
  
  java.lang.VerifyError: (class: org/apache/log4j/LogManager, method:
  clinit signature: ()V) Incompatible argument to function
  at org.apache.log4j.Logger.getLogger(Logger.java:94)
  at org.pdfbox.pdfparser.PDFParser.clinit(PDFParser.java:57)
  at
 
 org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDoc
 um
  ent.java:197)
  at
 
 org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDo
 cu
  ment.java:118)
  at Index.indexFile(Index.java:287)
  at Index.indexDirectory(Index.java:265)
  at Index.update(Index.java:63)
  at Lucene.main(Lucene.java:26)
  Exception in thread main
  
  I am using all the jar files that came with PDFBox.
  
  Anyone run into this problem.  I am using the following line of code:
  
  Document doc = LucenePDFDocument.getDocument(f);
  
  Thanks
  
  
  Don Vaillancourt
  Director of Software Development
  
  WEB IMPACT INC.
  416-815-2000 ext. 245
  email: [EMAIL PROTECTED]
  web: http://www.web-impact.com
  
  
  
  
  This email message is intended only for the addressee(s)
  and contains information that may be confidential and/or
  copyright.  If you are not the intended recipient please
  notify the sender by reply email and immediately delete
  this email. Use, disclosure or reproduction of this email
  by anyone other than the intended recipient(s) is strictly
  prohibited. No representation is made that this email or
  any attachments are free of viruses. Virus scanning is
  recommended and is the responsibility of the recipient.
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
 
 
  --
  *Don Vaillancourt
  Director of Software Development
  *
  *WEB IMPACT INC.*
  phone: 416-815-2000 ext. 245
  fax: 416-815-2001
  email: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
  web: http://www.web-impact.com
 
 
 
  / This email message is intended only for the addressee(s)
  and contains information that may be confidential and/or
  copyright. If you are not the intended recipient please
  notify the sender by reply email and immediately delete
  this email. Use, disclosure or reproduction of this email
  by anyone other than the intended recipient(s) is strictly
  prohibited. No representation is made that this email or
  any attachments are free of viruses. Virus scanning is
  recommended and is the responsibility of the recipient.
  /
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance when computing computing a filter using hundreds of diff terms.

2004-08-06 Thread Paul Elschot
Kevin,


On Thursday 05 August 2004 23:32, Kevin A. Burton wrote:
 I'm trying to compute a filter to match documents in our index by a set
 of terms.

 For example some documents have a given field 'category' so I need to
 compute a filter with mulitple categories.

 The problem is that our category list is  200 items so it takes about
 80 seconds to compute.  We cache it of course but this seems WAY too slow.

 Is there anything I could do to speed it up?  Maybe run the queries
 myself and then combine the bitsets?

That would be a first step.

 We're using a BooleanQuery with nested TermQueries to build up the
 filter...

I suppose that is a BooleanQuery with all terms optional?
Depending on the number of docs in the index and the distribution of
the categories over the classes that might lead to a lot of disk head
movements.

Recently some code was posted to compute a filter for date ranges.
For each date (ie. Term) in the range it would walk all documents and
set the corresponding bit in a bitset. You can use the same approach.
See IndexReader.termDocs(Term) for starters, and preferably iterate
over the categories (Terms) in sorted order.

A BooleanQuery would do much the same thing, but it has to work
in document order for all Term's at the same time, which can cause
extra disk seeks between the TermDocs.
You can avoid those disk seeks by iterating over the TermDocs yourself
and keeping the results in the bitset.

If you do this in with sorted terms, ideally the disk head would move in
a single direction for the whole process. For maximum performance 
you might want to avoid searching other Query's or similar TermDoc
iterators at the same time. Also avoid retrieving documents
while this is going on, just keep that disk head moving only where you
want it to.

For further CPU speedup you can cache the TermDocs using the
read() method. Lucene's TermScorer does this, see 
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
and use 'view' on the latest revision. A bigger cache size than 32 would seem
appropriate for your case.

Could you evt. report the speedup? I guess you should be able
to bring it down to at most twenty seconds or so.

After that, replication over multiple disks might help, giving each of them
an interval of the sorted categories to search.

Good luck,
Paul







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question on number of fields in a document.

2004-08-04 Thread Paul Elschot
On Wednesday 04 August 2004 18:22, John Z wrote:
 Hi

 I had a question related to number of fields in a document. Is there any
 limit to the number of fields you can have in an index.

 We have around 25-30 fields per document at present, about 6 are keywords, 
 Around 6 stored, but not indexed and rest of them are text, which is
 analyzed and indexed fields. We are planning on adding around 24 more
 fields , mostly keywords.

 Does anyone see any issues with this? Impact to search or index ?

During search one byte of RAM is needed per searched field per document
for the normalisation factors, even if a document field is empty.
This RAM is occupied the first time a field is searched after opening
an index reader.
Supposing your queries would actually search 50 fields before
closing the index reader, the norms would occupy 50 bytes/doc, or
1 GB / 20MDocs.

Regards,
Paul

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: pdfbox performance.

2004-07-28 Thread Paul Smith
The first thing that I would do is wrap the FileInputStream with a
BufferedInputStream.
Change: 
   FileInputStream reader = new FileInputStream(file);
To:
InputStream reader = new BufferedInputStream(new FileInputStream(file));
You get a significant boost reading in from a buffer, particularly as the
size of the file grows.
Try that first, and then rebenchmark.
Cheers
Paul Smith
 -Original Message-
 From: Miroslaw Milewski [mailto:[EMAIL PROTECTED]
 Sent: Thursday, July 29, 2004 7:24 AM
 To: [EMAIL PROTECTED]
 Subject: pdfbox performance.
 
 
   Hi,
 
   I have a serious performance problem while extracting text from pdf.
 
   Here is the code (w/o try/catch blocks):
 
   File file = new File(test.pdf);
   FileInputStream reader = new FileInputStream(file);
 
   PDFParser parser = new PDFParser(reader);
   parser.parse();
   PDDocument pdDoc = parser.getPDDocument();
 
   PDFTextStripper stripper = new PDFTextStripper();
   String pdftext = stripper.getText(pdDoc);
 
   pdDoc.close();
 
   Now, the whole process takes:
   - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
   - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
   - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
   - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
 
   Now, I can't really get the point here. Is this performance standard
 for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
 or maybe the pdf docs (text only, the last one with some UML diags.)
 
   I am writing a knowledge base system at the moment, and planned to do
 real-time text extraction and indexing (using Lucene.) But this is not
 realistic, considering the extraction thime.
   Then maybe it is a better idea to run the extraction and indexing once
 every 24 h, processing all the documents added during that period.
 
   TIA for any comments/suggestions.
 
 --
   Miroslaw Milewski
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Rebuild and corruption

2004-07-28 Thread Paul
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Steve Rajavuori wrote:

 I have two questions.

 1) Can anyone recommend the best way to avoid any possibility of corruption
 in the case where an IndexWriter doesn't get closed properly? (It seems
 that termination during a merge operation is the most vulnerable point.)

 2) Is there any way to recover a corrupted index, other than rebuilding
 from scratch?


I am also extremely interested in any answers to these questions.

Cheers,
Paul.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBCB+8tfkpAgkMOyMRAm7fAJ47u2eLNB9o98aI8rqQPHfNUK5QpQCePe0m
p0hm3iPtCxIZd9JUr6PfJ3I=
=CepG
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of TermDocs

2004-07-26 Thread Paul Elschot
On Monday 26 July 2004 21:41, John Patterson wrote:

 Is there any way to cache TermDocs?  Is this a good idea?

Lucene does this internally by buffering
up to 32 document numbers in advance for a query Term.
You can view the details here in case you're interested:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
It uses the TermDocs.read() method to fill a buffer of document numbers.

Is this what you had in mind?

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   >