Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread Jason Polites
Hello all,
I am looking for a strategy to exclude duplicate entries when searching 
multiple indexes which may contain the same document.  I have an email 
system which archives and indexes emails on a per-recipient basis.  So, each 
email recipient has their own index.  In the case where the same email is 
delivered to more than one recipient, each recipient's index stores a record 
of effectively the same document.  Now, there is a requirement to perform a 
search across multiple indexes, for which I am using the 
ParallelMultiSearcher.  The problem is that this results in duplicate 
entries in the Hits returned.  I can easily transfer the results into some 
form of java.util.Set to guarantee uniqueness, however I have a problem with 
the length() of the Hits object returned.  Ideally I need a way of filtering 
the Hits based on a no duplicate rule.  I am aware of the Filter object 
however the unique identifier of my document is a field within the lucene 
document itself (messageid); and I am reluctant to access this field using 
the public API for every Hit as I fear it will have drastic performance 
implications.

The ideal solution for me would be to specify a field during the search 
which is guaranteed to be unique across the Hits returned.  Anyone know of 
an elegant way to do this?  Alternatively is there a way I can de-dupe the 
list myself without loading every document?

Apologies for the length of this question.
P.S.  The separation of indexes per-recipient is a mandatory requirement. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: keep indexes as files or save them in database

2005-01-24 Thread Miles Barr
On Sun, 2005-01-23 at 22:09 -0800, Otis Gospodnetic wrote:
 A number of people have tried putting Lucene indices in RDBMS.  As far
 as I know, all were slower than FSDirectory.

Do you know if the Berkeley DB back end also has a performance hit?



-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread PA
On Jan 24, 2005, at 09:14, Jason Polites wrote:
I am aware of the Filter object however the unique identifier of my 
document is a field within the lucene document itself (messageid); and 
I am reluctant to access this field using the public API for every Hit 
as I fear it will have drastic performance implications.
Well... I don't see any way around that as you basically want to 
uniquely identify your messages based on their Message-ID.

That said, you don't need to do it during the search itself. You could 
simply perform your search as you do now and then create a set of 
unique messages while preserving Lucene Hits sort ordering for 
relevance purpose.

HTH.
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Stemming

2005-01-24 Thread Kevin L. Cobb
Do stemming algorithms take into consideration abbreviations too? Some
examples:

mg = milligrams
US = United States
CD = compact disc
vcr = video casette recorder

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations inside
or outside of Lucene?

Thanks,

Kevin


-Original Message-
From: Chris Lamprecht [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 5:51 PM
To: Lucene Users List
Subject: Re: Stemming

Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
[EMAIL PROTECTED] wrote:
 OK, OK ... I'll buy the book. I guess its about time since I am deeply
 and forever in love with Lucene. Might as well take the final plunge.
 
 
 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Friday, January 21, 2005 9:12 AM
 To: Lucene Users List
 Subject: Re: Stemming
 
 Hi Kevin,
 
 Stemming is an optional operation and is done in the analysis step.
 Lucene comes with a Porter stemmer and a Filter that you can use in an
 Analyzer:
 
 ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
 ./src/java/org/apache/lucene/analysis/PorterStemmer.java
 
 You can find more about it here:
 http://www.lucenebook.com/search?query=stemming
 You can also see mentions of SnowballAnalyzer in those search results,
 and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
 
 Otis
 
 --- Kevin L. Cobb [EMAIL PROTECTED] wrote:
 
  I want to understand how Lucene uses stemming but can't find any
  documentation on the Lucene site. I'll continue to google but hope
  that
  this list can help narrow my search. I have several questions on the
  subject currently but hesitate to list them here since finding a
good
  document on the subject may answer most of them.
 
 
 
  Thanks in advance for any pointers,
 
 
 
  Kevin
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote:
Do stemming algorithms take into consideration abbreviations too?
No, they don't.  Adding abbreviations, aliases, synonyms, etc is not 
stemming.

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations 
inside
or outside of Lucene?
Nothing built into Lucene does this, but the infrastructure allows it 
to be added in the form of a custom analysis step.  There are two basic 
approaches, adding aliases at indexing time, or adding them at query 
time by expanding the query.  I created some example analyzers in 
Lucene in Action (grab the source code from the site linked below) that 
demonstrate how this can be done using WordNet (and mock) synonym 
lookup.  You could extrapolate this into looking up abbreviations and 
adding them into the token stream.

http://www.lucenebook.com/search?query=synonyms
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Filtering w/ Multiple Terms

2005-01-24 Thread Jerry Jalenak
I spent some time reading the Lucene in Action book this weekend (great job,
btw), and came across the section on using custom filters.  Since the data
that I need to use to filter my hit set with comes from a database, I
thought it would be worth my effort this morning to write a custom filter
that would handle the filtering for me.  So, using the example from the book
(page 210), I've coded an AccountFilter:

public class AccountFilter extends Filter
{
public AccountFilter()
{}

public BitSet bits(IndexReader indexReader)
throws IOException
{
System.out.println(Entering AccountFilter...);
BitSet bitSet = new BitSet(indexReader.maxDoc());

String[] reportingAccounts = new String[] {0011, 4kfs};

int[] docs = new int[1];
int[] freqs = new int[1];

for (int i = 0; i  reportingAccounts.length; i++)
{
String reportingAccount = reportingAccounts[i];
if (reportingAccount != null)
{
TermDocs termDocs = indexReader.termDocs(new
Term(account, reportingAccount));
int count = termDocs.read(docs, freqs);
if (count == 1)
{
System.out.println(Setting bit
on);
bitSet.set(docs[0]);
}
}
}
System.out.println(Leaving AccountFilter...);
return bitSet;
}
}

I see where the AccountFilter is setting the cooresponding 'bits', but I end
up without any 'hits':

Entering AccountFilter...
Entering AccountFilter...
Entering AccountFilter...
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Leaving AccountFilter...
Leaving AccountFilter...
Leaving AccountFilter...
... Found 0 matching documents in 1000 ms

Can anyone tell me what I've done wrong?

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Friday, January 21, 2005 8:15 AM
 To: Lucene Users List
 Subject: RE: Filtering w/ Multiple Terms
 
 
 This:
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/se
 arch/BooleanQuery.TooManyClauses.html
 ?
 
 You can control that limit via
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/se
 arch/BooleanQuery.html#maxClauseCount
 
 Otis
 
 
 --- Jerry Jalenak [EMAIL PROTECTED] wrote:
 
  OK.  But isn't there a limit on the number of 
 BooleanQueries that can
  be
  combined with AND / OR / etc?
  
  
  
  Jerry Jalenak
  Senior Programmer / Analyst, Web Publishing
  LabOne, Inc.
  10101 Renner Blvd.
  Lenexa, KS  66219
  (913) 577-1496
  
  [EMAIL PROTECTED]
  
  
   -Original Message-
   From: Erik Hatcher [mailto:[EMAIL PROTECTED]
   Sent: Thursday, January 20, 2005 5:05 PM
   To: Lucene Users List
   Subject: Re: Filtering w/ Multiple Terms
   
   
   
   On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
   
In looking at the examples for filtering of hits, it looks 
   like I can 
only
specify a single term; i.e.
   
Filter f = new QueryFilter(new TermQuery(new 
 Term(acct,
acct1)));
   
I need to specify more than one term in my filter.  Short of
  using 
something
like ChainFilter, how are others handling this?
   
   You can make as complex of a Query as you want for 
   QueryFilter.  If you 
   want to filter on multiple terms, construct a BooleanQuery 
   with nested 
   TermQuery's, either in an AND or OR fashion.
   
 Erik
   
   
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
   
  
  This transmission (and any information attached to it) may be
  confidential and
  is intended solely for the use of the individual or entity to which
  it is
  addressed. If you are not the intended recipient or the person
  responsible for
  delivering the transmission to the intended recipient, be advised
  that you
  have received this transmission in error and that any use,
  dissemination,
  forwarding, printing, or copying of this information is strictly
  prohibited.
  If you have received this transmission in error, please immediately
  notify
  LabOne at the following email address:
  [EMAIL PROTECTED]
  
  
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -

Re: keep indexes as files or save them in database

2005-01-24 Thread Andi Vajda

On Sun, 2005-01-23 at 22:09 -0800, Otis Gospodnetic wrote:
A number of people have tried putting Lucene indices in RDBMS.  As far
as I know, all were slower than FSDirectory.
Do you know if the Berkeley DB back end also has a performance hit?
Try it, it all depends on how you configure it. And that depends on your 
needs. I posted examples to the list last week.

Andi..
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Filtering w/ Multiple Terms

2005-01-24 Thread Paul Elschot
Jerry,

On Monday 24 January 2005 18:26, Jerry Jalenak wrote:
 I spent some time reading the Lucene in Action book this weekend (great job,
 btw), and came across the section on using custom filters.  Since the data
 that I need to use to filter my hit set with comes from a database, I
 thought it would be worth my effort this morning to write a custom filter
 that would handle the filtering for me.  So, using the example from the book
 (page 210), I've coded an AccountFilter:
 
 public class AccountFilter extends Filter
 {
   public AccountFilter()
   {}
   
   public BitSet bits(IndexReader indexReader)
   throws IOException
   {
   System.out.println(Entering AccountFilter...);
   BitSet bitSet = new BitSet(indexReader.maxDoc());
 
   String[] reportingAccounts = new String[] {0011, 4kfs};
   
   int[] docs = new int[1];
   int[] freqs = new int[1];
   
   for (int i = 0; i  reportingAccounts.length; i++)
   {
   String reportingAccount = reportingAccounts[i];
   if (reportingAccount != null)
   {
   TermDocs termDocs = indexReader.termDocs(new
 Term(account, reportingAccount));
   int count = termDocs.read(docs, freqs);
   if (count == 1)

Unless account is a primary key fied, it's better to loop over the termdocs.

   {
   System.out.println(Setting bit
 on);
   bitSet.set(docs[0]);
   }
   }
   }
   System.out.println(Leaving AccountFilter...);
   return bitSet;
   }
 }
 
 I see where the AccountFilter is setting the cooresponding 'bits', but I end
 up without any 'hits':
 
 Entering AccountFilter...
 Entering AccountFilter...
 Entering AccountFilter...
 Setting bit on
 Setting bit on
 Setting bit on
 Setting bit on
 Setting bit on
 Leaving AccountFilter...
 Leaving AccountFilter...
 Leaving AccountFilter...

I don't see any recursion in your code, but this output
suggests nesting three deep. Something does not add up here.

 ... Found 0 matching documents in 1000 ms
 
 Can anyone tell me what I've done wrong?

Maybe all query hits were filtered out?
Could you compare the docnrs in the bits of the filter with the
unfiltered query hits docnrs?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filtering w/ Multiple Terms

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote:
I spent some time reading the Lucene in Action book this weekend 
(great job,
btw)
Thanks!
public class AccountFilter extends Filter
I see where the AccountFilter is setting the cooresponding 'bits', but 
I end
up without any 'hits':

Entering AccountFilter...
Entering AccountFilter...
Entering AccountFilter...
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Leaving AccountFilter...
Leaving AccountFilter...
Leaving AccountFilter...
... Found 0 matching documents in 1000 ms
Can anyone tell me what I've done wrong?
A filter constrains which documents will be consulted during a search, 
but the Query needs to match some documents that are turned on by the 
filter bits.  I'm guessing that your Query did not match any of the 
documents you turned on.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Filtering w/ Multiple Terms

2005-01-24 Thread Jerry Jalenak
Paul / Erik - 

I'm use the ParallelMultiSearcher to search three indexes concurrently -
hence the three entries into AccountFilter.  If I remove the filter from my
query, and simply enter the query on the command line, I get two hits back.
In other words, I can enter this:

smith AND (account:0011)

and get hits back.  When I add the filter back in (which should take care of
the account:0011 part of the query), and enter only smith as my query, I get
0 hits.



Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Monday, January 24, 2005 1:07 PM
 To: Lucene Users List
 Subject: Re: Filtering w/ Multiple Terms
 
 
 
 On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote:
  I spent some time reading the Lucene in Action book this weekend 
  (great job,
  btw)
 
 Thanks!
 
  public class AccountFilter extends Filter
  I see where the AccountFilter is setting the cooresponding 
 'bits', but 
  I end
  up without any 'hits':
 
  Entering AccountFilter...
  Entering AccountFilter...
  Entering AccountFilter...
  Setting bit on
  Setting bit on
  Setting bit on
  Setting bit on
  Setting bit on
  Leaving AccountFilter...
  Leaving AccountFilter...
  Leaving AccountFilter...
  ... Found 0 matching documents in 1000 ms
 
  Can anyone tell me what I've done wrong?
 
 A filter constrains which documents will be consulted during 
 a search, 
 but the Query needs to match some documents that are turned on by the 
 filter bits.  I'm guessing that your Query did not match any of the 
 documents you turned on.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filtering w/ Multiple Terms

2005-01-24 Thread Erik Hatcher
As Paul suggested, output the Lucene document numbers from your Hits, 
and also output which bit you're setting in your filter.  Do those sets 
overlap?

Erik
On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote:
Paul / Erik -
I'm use the ParallelMultiSearcher to search three indexes concurrently 
-
hence the three entries into AccountFilter.  If I remove the filter 
from my
query, and simply enter the query on the command line, I get two hits 
back.
In other words, I can enter this:

smith AND (account:0011)
and get hits back.  When I add the filter back in (which should take 
care of
the account:0011 part of the query), and enter only smith as my query, 
I get
0 hits.


Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496
[EMAIL PROTECTED]

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, January 24, 2005 1:07 PM
To: Lucene Users List
Subject: Re: Filtering w/ Multiple Terms

On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote:
I spent some time reading the Lucene in Action book this weekend
(great job,
btw)
Thanks!
public class AccountFilter extends Filter
I see where the AccountFilter is setting the cooresponding
'bits', but
I end
up without any 'hits':
Entering AccountFilter...
Entering AccountFilter...
Entering AccountFilter...
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Leaving AccountFilter...
Leaving AccountFilter...
Leaving AccountFilter...
... Found 0 matching documents in 1000 ms
Can anyone tell me what I've done wrong?
A filter constrains which documents will be consulted during
a search,
but the Query needs to match some documents that are turned on by the
filter bits.  I'm guessing that your Query did not match any of the
documents you turned on.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

This transmission (and any information attached to it) may be 
confidential and
is intended solely for the use of the individual or entity to which it 
is
addressed. If you are not the intended recipient or the person 
responsible for
delivering the transmission to the intended recipient, be advised that 
you
have received this transmission in error and that any use, 
dissemination,
forwarding, printing, or copying of this information is strictly 
prohibited.
If you have received this transmission in error, please immediately 
notify
LabOne at the following email address: 
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-24 Thread David Spencer
Pierrick Brihaye wrote:
Hi,
David Spencer a écrit :
One example of expansion with the synonym boost set to 0.9 is the 
query big dog expands to:

Interesting.
Do you plan to add expansion on other Wordnet relationships ? Hypernyms 
and hyponyms would be a good start point for thesaurus-like search, 
wouldn't it ?
Good point, I hadn't considered this - but how would it work -just 
consider these 2 relationships synonyms (thus easier to use) or make 
it separate (too academic?)
However, I'm afraid that this kind of feature would require refactoring, 
probably based on WordNet-dedicated libraries. JWNL 
(http://jwordnet.sourceforge.net/) may be a good candidate for this.
Good point, should leverage existing code.

Thank you for your work.
thx,
 Dave
Cheers,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Filtering w/ Multiple Terms

2005-01-24 Thread Jerry Jalenak
sheepish-look-on-face/

After re-reading the book (again), and the javadocs (again), it dawned on my
little brain that I needed to have a doc and freq array *the size of
maxDocs* for the index reader.  I also needed to iterate through the docs
array and call bitSet.set for each entry in docs (that was valid, of
course).  Everything is good now

Thanks!

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Monday, January 24, 2005 1:27 PM
 To: Lucene Users List
 Subject: Re: Filtering w/ Multiple Terms
 
 
 As Paul suggested, output the Lucene document numbers from your Hits, 
 and also output which bit you're setting in your filter.  Do 
 those sets 
 overlap?
 
   Erik
 
 On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote:
 
  Paul / Erik -
 
  I'm use the ParallelMultiSearcher to search three indexes 
 concurrently 
  -
  hence the three entries into AccountFilter.  If I remove the filter 
  from my
  query, and simply enter the query on the command line, I 
 get two hits 
  back.
  In other words, I can enter this:
 
  smith AND (account:0011)
 
  and get hits back.  When I add the filter back in (which 
 should take 
  care of
  the account:0011 part of the query), and enter only smith 
 as my query, 
  I get
  0 hits.
 
 
 
  Jerry Jalenak
  Senior Programmer / Analyst, Web Publishing
  LabOne, Inc.
  10101 Renner Blvd.
  Lenexa, KS  66219
  (913) 577-1496
 
  [EMAIL PROTECTED]
 
 
  -Original Message-
  From: Erik Hatcher [mailto:[EMAIL PROTECTED]
  Sent: Monday, January 24, 2005 1:07 PM
  To: Lucene Users List
  Subject: Re: Filtering w/ Multiple Terms
 
 
 
  On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote:
  I spent some time reading the Lucene in Action book this weekend
  (great job,
  btw)
 
  Thanks!
 
  public class AccountFilter extends Filter
  I see where the AccountFilter is setting the cooresponding
  'bits', but
  I end
  up without any 'hits':
 
  Entering AccountFilter...
  Entering AccountFilter...
  Entering AccountFilter...
  Setting bit on
  Setting bit on
  Setting bit on
  Setting bit on
  Setting bit on
  Leaving AccountFilter...
  Leaving AccountFilter...
  Leaving AccountFilter...
  ... Found 0 matching documents in 1000 ms
 
  Can anyone tell me what I've done wrong?
 
  A filter constrains which documents will be consulted during
  a search,
  but the Query needs to match some documents that are 
 turned on by the
  filter bits.  I'm guessing that your Query did not match any of the
  documents you turned on.
 
 Erik
 
 
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: 
 [EMAIL PROTECTED]
 
 
 
  This transmission (and any information attached to it) may be 
  confidential and
  is intended solely for the use of the individual or entity 
 to which it 
  is
  addressed. If you are not the intended recipient or the person 
  responsible for
  delivering the transmission to the intended recipient, be 
 advised that 
  you
  have received this transmission in error and that any use, 
  dissemination,
  forwarding, printing, or copying of this information is strictly 
  prohibited.
  If you have received this transmission in error, please immediately 
  notify
  LabOne at the following email address: 
  [EMAIL PROTECTED]
 
 
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate hits using ParallelMultiSearcher

2005-01-24 Thread Jason Polites
Agreed on the set of unique messages, however the problem I have is with 
the count of the Hits.  The Hits object may contain 100 results (for 
example), of which only 90 are unique.  Because I am paging through results 
10 at a time, I need to know the total count without loading each document. 
If I get a count of 100 but a Collection of only 90 my paging breaks.

After careful consideration I have decided that the better approach is to 
create a separate global index in which all messages are stored.  This 
will not only relieve my duplication issue but should also scale better 
if/when there are several hundred or several thousand distinct indexes.

Thanks,
- JP
- Original Message - 
From: PA [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Monday, January 24, 2005 10:43 PM
Subject: Re: Duplicate hits using ParallelMultiSearcher


On Jan 24, 2005, at 09:14, Jason Polites wrote:
I am aware of the Filter object however the unique identifier of my 
document is a field within the lucene document itself (messageid); and I 
am reluctant to access this field using the public API for every Hit as I 
fear it will have drastic performance implications.
Well... I don't see any way around that as you basically want to uniquely 
identify your messages based on their Message-ID.

That said, you don't need to do it during the search itself. You could 
simply perform your search as you do now and then create a set of unique 
messages while preserving Lucene Hits sort ordering for relevance 
purpose.

HTH.
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Sort Performance Problems across large dataset

2005-01-24 Thread Peter Hollas
I am working on a public accessible Struts based species database project 
where the number of species names is currently at 2.3 million, and in the 
near future will be somewhere nearer 4 million (probably the largest there 
is). The species names are typically 1 to 7 words in length, and the broad 
requirement is to be able to do a fulltext search across them. It is also 
necessary to sort the results into alphabetical order by species name.

Currently we can issue a simple search query and expect a response back in 
about 0.2 seconds (~3,000 results) with the Lucene index that we have built. 
Lucene gives a much more predictable and faster average query time than 
using standard fulltext indexing with mySQL. This however returns result in 
score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species names as 
a seperate keyword field, and sorted using it whilst querying. This solution 
works fine, but is unacceptable since a query that returns thousands of 
results can take upwards of 30 seconds to sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort will be 
to perform a monthly index rebuild, and return results by index order (about 
a day to re-index!). But ideally there might be a way to modify the Lucene 
API to incorporate a scoring system in a way that scores by lexical order.

Any ideas are appreciated!
Many thanks, Peter.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Stefan Groschupf
Hi,
do you optimize the index?
Do you tried to implement a own hit collector?
Stefan
Am 25.01.2005 um 01:01 schrieb Peter Hollas:
I am working on a public accessible Struts based species database 
project where the number of species names is currently at 2.3 million, 
and in the near future will be somewhere nearer 4 million (probably 
the largest there is). The species names are typically 1 to 7 words in 
length, and the broad requirement is to be able to do a fulltext 
search across them. It is also necessary to sort the results into 
alphabetical order by species name.

Currently we can issue a simple search query and expect a response 
back in about 0.2 seconds (~3,000 results) with the Lucene index that 
we have built. Lucene gives a much more predictable and faster average 
query time than using standard fulltext indexing with mySQL. This 
however returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst 
querying. This solution works fine, but is unacceptable since a query 
that returns thousands of results can take upwards of 30 seconds to 
sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort 
will be to perform a monthly index rebuild, and return results by 
index order (about a day to re-index!). But ideally there might be a 
way to modify the Lucene API to incorporate a scoring system in a way 
that scores by lexical order.

Any ideas are appreciated!
Many thanks, Peter.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
company:http://www.media-style.com
forum:  http://www.text-mining.org
blog:   http://www.find23.net
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Xiaohong Yang \(Sharon\)
Hi Peter,
I just got on the list a few hours ago.  I am still reading the source code.  I 
am not going to send this to the list.
 
I would like to know the .2 sec query time for 2 million fields, should it 
display only the first page (100 or so), not the whole 3000 found?  It is very 
fast I agree.  
 
If the alphabetic index display only a link, not the content, then it should 
not be very slow since you only need to sort part of what a user need.  May be 
display only the first A page, as it did with the regular scored results.  
Just my thought, might not work for you.
 
Do you store the Lucene index in the database or in a text file?
 
Best,
Sharon
LangPower Computing, Inc.
http://www.indexingonline.com

Peter Hollas [EMAIL PROTECTED] wrote:
I am working on a public accessible Struts based species database project 
where the number of species names is currently at 2.3 million, and in the 
near future will be somewhere nearer 4 million (probably the largest there 
is). The species names are typically 1 to 7 words in length, and the broad 
requirement is to be able to do a fulltext search across them. It is also 
necessary to sort the results into alphabetical order by species name.

Currently we can issue a simple search query and expect a response back in 
about 0.2 seconds (~3,000 results) with the Lucene index that we have built. 
Lucene gives a much more predictable and faster average query time than 
using standard fulltext indexing with mySQL. This however returns result in 
score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species names as 
a seperate keyword field, and sorted using it whilst querying. This solution 
works fine, but is unacceptable since a query that returns thousands of 
results can take upwards of 30 seconds to sort them.

My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort will be 
to perform a monthly index rebuild, and return results by index order (about 
a day to re-index!). But ideally there might be a way to modify the Lucene 
API to incorporate a scoring system in a way that scores by lexical order.

Any ideas are appreciated!

Many thanks, Peter.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sort Performance Problems across large dataset

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:01 PM, Peter Hollas wrote:
I am working on a public accessible Struts based
Well there's the problem right there :))
(just kidding)
To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst 
querying. This solution works fine, but is unacceptable since a query 
that returns thousands of results can take upwards of 30 seconds to 
sort them.
30 seconds... wow.
My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort 
will be to perform a monthly index rebuild, and return results by 
index order (about a day to re-index!). But ideally there might be a 
way to modify the Lucene API to incorporate a scoring system in a way 
that scores by lexical order.
What about assigning a numeric value field for each document with the 
number indicating the alphabetical ordering?  Off the top of my head, 
I'm not sure how this could be done, but perhaps some clever hashing 
algorithm could do this?  Or consider each character position one digit 
in a base 27 (or 27 to include a space) and construct a number for 
that?  (though that would be an enormous number and probably too large) 
- sorry my off-the-cuff estimating skills are not what they should be.

Certainly sorting by a numeric value is far less resource intensive 
than by String - so perhaps that is worth a try?  At the very least, 
give each document a random number and try sorting by that field (the 
value of the field can be Integer.toString()) to see how it compares 
performance-wise.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Matt Quail
Peter,
Currently we can issue a simple search query and expect a response back 
in about 0.2 seconds (~3,000 results) 
You may want to try something like the following (I do this in FishEye, 
seems to be performant for moderately large field-spaces).

Use a custom HitCollector, and store all the matching doc-ids in a 
java.util.BitSet. This will still give you your 0.2second performance.

Then, use a TermDocs iterator to visit each term in your species name 
field, printing out (or whatever) each species name if it contains a 
docid in your bitset. Something like this pseudocode:

BitSet docs = doSearch(query); // 0.2seconds
TermEnum te = reader.terms(new Term(species-name, ));
TermDocs td = reader.termDocs();
Term t = te.term();
while (t!=null  t.field().equals(species-name)) {
  td.seek(te);
  while (td.next()) {
int docid = td.doc();
if (docs.get(docid)) {
  print match: + docid;
  break; // try next term
}
  }
  if (!te.next()) {
break;
  }
  t = te.term();
}
te.close();
td.close();
Now, with 2.3 million (or 4 million!) species names, I'm not sure how 
fast it will be to iterate through all the species-name termdocs. But 
I would be interested to find out; if you give this a code a try, could 
you report back your results?

=Matt
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


LUCENE + EXCEPTION

2005-01-24 Thread Karthik N S




Hi 
Guys
Apologies..

On 
STANDALONE Usge of UPDATION/DELETION/ADDITION of Documents into 
MergerIndex, the Code of mine
runs 
PERFECTLY with out any Problems.
But When the 
same Code is plugged into a WEBAPP on TOMCAT with a servlet Running in SINGLE 
THREAD MODE,Some times 
Frequently I 
get the Error as below
java.io.IOException: read past 
EOF at 
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:218) 
at 
org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) 
at 
org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) 
at 
org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323) 
at 
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64) 
at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) 
at 
org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) 
at 
org.apache.lucene.search.Hits.init(Hits.java:43) 
at 
org.apache.lucene.search.Searcher.search(Searcher.java:33) 
at 
org.apache.lucene.search.Searcher.search(Searcher.java:27)
Somebody 
Please tell me Why is this happening
O/s = 
Jentoo
JAVA = Jdk 
1.4.2
WEBAPP = 
TOMCAT
Lucene 
= 1.4.3
Thx in 
advance
Karthik












WITH WARM REGARDS HAVE A NICE DAY [ 
N.S.KARTHIK] 


Re: LUCENE + EXCEPTION

2005-01-24 Thread Chris Lamprecht
Hi Karthik,

If you are talking about SingleThreadModel (i.e. your servlet
implements javax.servlet.SingleThreadModel), this does not guarantee
that two different instances of your servlet won't be run at the same
time.  It only guarantees that each instance of your servlet will only
be run by one thread at a time.  See:

http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/SingleThreadModel.html

If you are accessing a shared resource (a lucene index), you'll have
to prevent concurrent modifications somehow other than
SingleThreadModel.

I think they've finally deprecated SingleThreadModel in the latest
(may be not even out yet) servlet spec.

-chris

 
 On STANDALONE Usge of   UPDATION/DELETION/ADDITION of Documents into
 MergerIndex, the  Code of mine
 
 
 runs PERFECTLY  with out any Problems.
 
 
 But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet
 Running in SINGLE THREAD MODE,Some times 
 
 
 Frequently I get the Error as below

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: LUCENE + EXCEPTION

2005-01-24 Thread Karthik N S
Hi

  Ok Still I have the Exeption in process ,If even I try to have a Servlet
Single Instance [may be by Authentication
  processs] , but I made shure that Lucene's MergerIndexing is controlled by
single Initiation...



  But With out any Shared Resource's the Exception is popping on Frequently,



   java.io.IOException: read past EOF
at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou
ndFileReader.java:218)
at
org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
at
org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356)
at
org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323)
at
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:64)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.init(Hits.java:43)
at org.apache.lucene.search.Searcher.search(Searcher.java:33)
at org.apache.lucene.search.Searcher.search(Searcher.java:27)



  Please Help me

 [ I could not find any solution on Lucene Form for the same,may be I am the
only one with the issue]

Karthik

-Original Message-
From: Chris Lamprecht [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 25, 2005 9:48 AM
To: Lucene Users List
Subject: Re: LUCENE + EXCEPTION


Hi Karthik,

If you are talking about SingleThreadModel (i.e. your servlet
implements javax.servlet.SingleThreadModel), this does not guarantee
that two different instances of your servlet won't be run at the same
time.  It only guarantees that each instance of your servlet will only
be run by one thread at a time.  See:

http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/SingleThreadMode
l.html

If you are accessing a shared resource (a lucene index), you'll have
to prevent concurrent modifications somehow other than
SingleThreadModel.

I think they've finally deprecated SingleThreadModel in the latest
(may be not even out yet) servlet spec.

-chris


 On STANDALONE Usge of   UPDATION/DELETION/ADDITION of Documents into
 MergerIndex, the  Code of mine


 runs PERFECTLY  with out any Problems.


 But When the same Code is plugged into a WEBAPP on TOMCAT with a servlet
 Running in SINGLE THREAD MODE,Some times


 Frequently I get the Error as below

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]